Text Mining of Reddit Data using R (or R Studio)

R Markdown

Description: This meetup is for anyone interested in learning and sharing knowledge about analyzing Reddit data using R. In this tutorial, we will be using RedditExtractoR and a few other R packages to analyze a dataset of Reddit posts.

Text mining is the process of analyzing large collections of unstructured text data to discover patterns, trends, and insights. With the rise of social media platforms like Reddit, there is a wealth of information available in the form of user-generated content that can be analyzed using text mining techniques.

R is a popular programming language and environment for statistical computing and graphics, widely used in data analysis and data visualization. In recent years, it has also become a powerful tool for text mining and natural language processing.

In this Meetup event, we will explore how to use R for text mining of Reddit data. We will walk through the process of collecting data from Reddit using its API, cleaning and preprocessing the data, and applying text mining techniques such as sentiment analysis and topic modeling. By the end of the session, you will have a basic understanding of how to use R for text mining of social media data and be able to apply these techniques to other similar datasets.

Who should attend?

This meetup is open to all skill levels.

Requirements: Participants should bring their laptops to the online event. Basic knowledge of R programming is recommended, but not required. Internet access will be required to access Yahoo Finance pages during the live coding session.

Processing data

Using a few R packages, we will clean and preprocess the data to prepare it for analysis. We will remove stop words, punctuations, and URLs from the text data.

This will create a corpus of the post titles and remove punctuations, URLs, and stop words. We also perform stemming to reduce words to their root form.

Creating a Document Term Matrix

We will now create a document term matrix to represent the text data.

Text Analysis using tm and other packages

We can now perform text analysis using tm and other packages. We will start by creating a few plots (word cloud, etc.) to visualize the most frequent words in the post titles.

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

## Loading required package: NLP

## Loading required package: RColorBrewer

## 
## Attaching package: 'syuzhet'

## The following object is masked from 'package:rtweet':
## 
##     get_tokens

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Rows: 192 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): title, text, subreddit, url
## dbl  (2): timestamp, comments
## date (1): date_utc
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## spc_tbl_ [192 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date_utc : Date[1:192], format: "2023-02-19" "2023-02-14" ...
##  $ timestamp: num [1:192] 1.68e+09 1.68e+09 1.68e+09 1.68e+09 1.68e+09 ...
##  $ title    : chr [1:192] "The Limiting Factor -- The Tesla Semi: Why Now?" "See Tesla\031s Berlin Factory In Detail, Including \034Godzilla\035 The Robot" "Tesla is \030setting the standard\031 for the EV industry, says ARK\031s Chief Futurist Brett Winston - Yahoo Finance" "FSDBeta 11.3.1 - Single Stack First Drive - Orlando FL (Chuck Cook)" ...
##  $ text     : chr [1:192] NA NA NA NA ...
##  $ subreddit: chr [1:192] "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" ...
##  $ comments : num [1:192] 4 2 6 9 12 3 8 9 6 64 ...
##  $ url      : chr [1:192] "https://www.reddit.com/r/teslainvestorsclub/comments/116nhv1/the_limiting_factor_the_tesla_semi_why_now/" "https://www.reddit.com/r/teslainvestorsclub/comments/1125y27/see_teslas_berlin_factory_in_detail_including/" "https://www.reddit.com/r/teslainvestorsclub/comments/11gdjcd/tesla_is_setting_the_standard_for_the_ev_industry/" "https://www.reddit.com/r/teslainvestorsclub/comments/11lzdgt/fsdbeta_1131_single_stack_first_drive_orlando_fl/" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date_utc = col_date(format = ""),
##   ..   timestamp = col_double(),
##   ..   title = col_character(),
##   ..   text = col_character(),
##   ..   subreddit = col_character(),
##   ..   comments = col_double(),
##   ..   url = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

## [1] 192

## [1] 7

##  chr [1:192] "The Limiting Factor -- The Tesla Semi: Why Now?" ...

## [1] "The Limiting Factor -- The Tesla Semi: Why Now?"                                                                      
## [2] "See Tesla\031s Berlin Factory In Detail, Including \034Godzilla\035 The Robot"                                        
## [3] "Tesla is \030setting the standard\031 for the EV industry, says ARK\031s Chief Futurist Brett Winston - Yahoo Finance"
## [4] "FSDBeta 11.3.1 - Single Stack First Drive - Orlando FL (Chuck Cook)"                                                  
## [5] "Munro Live Q&amp;A Panel post-investor day"                                                                           
## [6] "Tesla Master Plan 3 + Investor Day // What to Expect Ï=\v- The Limiting Factor"

## [1] "Tesla Cybertruck is starting to look more refined with new black tonneau cover"                      
## [2] "Tesla to halt some China production for upgrades"                                                    
## [3] "Think Tesla Is Losing Popularity? Think Again"                                                       
## [4] "Daily Thread - March 01, 2023"                                                                       
## [5] "Model S and X price reduction in USA"                                                                
## [6] "How Tesla could produce a car that costs 36.9% cheaper than a Toyota Camry. (And why it won't, yet.)"

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 192

## $content
## [1] "Expected utilization of massive cashflows."

## Error in FUN(X[[i]], ...): unused argument (mc.cores = 1)

##                  word freq
## tesla           tesla   78
## daily           daily   30
## thread         thread   29
## february     february   23
## the               the   19
## new               new   15
## tsla             tsla   14
## elon             elon   13
## model           model   12
## day               day   11
## week             week   11
## investor     investor   10
## musk             musk   10
## teslas         teslas   10
## march           march    9
## says             says    9
## will             will    9
## car               car    8
## china           china    7
## ford             ford    7
## price           price    7
## battery       battery    6
## evs               evs    6
## first           first    6
## mexico         mexico    6
## news             news    6
## now               now    6
## production production    6
## stock           stock    6
## cars             cars    5

References

‘RedditExtractoR’ - An R Package that helps you access the Reddit API: https://github.com/ivan-rivera/RedditExtractor

What Are APIs? - Simply Explained: https://www.youtube.com/watch?v=OVvTv9Hy91Q