Text Mining of Reddit Data using R (or R Studio)

R Markdown

Description: This meetup is for anyone interested in learning and sharing knowledge about analyzing Reddit data using R. In this tutorial, we will be using RedditExtractoR and a few other R packages to analyze a dataset of Reddit posts.

Text mining is the process of analyzing large collections of unstructured text data to discover patterns, trends, and insights. With the rise of social media platforms like Reddit, there is a wealth of information available in the form of user-generated content that can be analyzed using text mining techniques.

R is a popular programming language and environment for statistical computing and graphics, widely used in data analysis and data visualization. In recent years, it has also become a powerful tool for text mining and natural language processing.

In this Meetup event, we will explore how to use R for text mining of Reddit data. We will walk through the process of collecting data from Reddit using its API, cleaning and preprocessing the data, and applying text mining techniques such as sentiment analysis and topic modeling. By the end of the session, you will have a basic understanding of how to use R for text mining of social media data and be able to apply these techniques to other similar datasets.

Who should attend?

This meetup is open to all skill levels.

Requirements: Participants should bring their laptops to the online event. Basic knowledge of R programming is recommended, but not required. Internet access will be required to access Yahoo Finance pages during the live coding session.

Processing data

Using a few R packages, we will clean and preprocess the data to prepare it for analysis. We will remove stop words, punctuations, and URLs from the text data.

This will create a corpus of the post titles and remove punctuations, URLs, and stop words. We also perform stemming to reduce words to their root form.

Creating a Document Term Matrix

We will now create a document term matrix to represent the text data.

Text Analysis using tm and other packages

We can now perform text analysis using tm and other packages. We will start by creating a few plots (word cloud, etc.) to visualize the most frequent words in the post titles.

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

## Loading required package: NLP

## Loading required package: RColorBrewer

## 
## Attaching package: 'syuzhet'

## The following object is masked from 'package:rtweet':
## 
##     get_tokens

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Rows: 152 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): title, text, subreddit, url
## dbl  (2): timestamp, comments
## date (1): date_utc
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## spc_tbl_ [152 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date_utc : Date[1:152], format: "2023-04-01" "2023-04-11" ...
##  $ timestamp: num [1:152] 1.68e+09 1.68e+09 1.68e+09 1.68e+09 1.68e+09 ...
##  $ title    : chr [1:152] "My Last \"Full Self Driving\" Video | AI DRIVR" "Week 62 update for #FSDBeta Community Tracker" "Daily Thread - April 11, 2023" "Daily Thread - March 31, 2023" ...
##  $ text     : chr [1:152] NA NA "All topics are permitted in this thread. If you are new here (or even if you're not), please skim through our ["| __truncated__ "All topics are permitted in this thread. If you are new here (or even if you're not), please skim through our ["| __truncated__ ...
##  $ subreddit: chr [1:152] "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" ...
##  $ comments : num [1:152] 6 6 61 130 17 26 175 0 75 8 ...
##  $ url      : chr [1:152] "https://www.reddit.com/r/teslainvestorsclub/comments/128qgoh/my_last_full_self_driving_video_ai_drivr/" "https://www.reddit.com/r/teslainvestorsclub/comments/12iejks/week_62_update_for_fsdbeta_community_tracker/" "https://www.reddit.com/r/teslainvestorsclub/comments/12ibm24/daily_thread_april_11_2023/" "https://www.reddit.com/r/teslainvestorsclub/comments/127cyg1/daily_thread_march_31_2023/" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date_utc = col_date(format = ""),
##   ..   timestamp = col_double(),
##   ..   title = col_character(),
##   ..   text = col_character(),
##   ..   subreddit = col_character(),
##   ..   comments = col_double(),
##   ..   url = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

## [1] 152

## [1] 7

##  chr [1:152] "My Last \"Full Self Driving\" Video | AI DRIVR" ...

## [1] "My Last \"Full Self Driving\" Video | AI DRIVR"                   
## [2] "Week 62 update for #FSDBeta Community Tracker"                    
## [3] "Daily Thread - April 11, 2023"                                    
## [4] "Daily Thread - March 31, 2023"                                    
## [5] "Long-Term Shareholder Returns: Evidence from 64,000 Global Stocks"
## [6] "Jim Cramer really doesn't get it AKA the Cybertruck Lambo"

## [1] "Daily Thread - April 20, 2023"                  
## [2] "Daily Thread - April 03, 2023"                  
## [3] "China insurance data Week March 20 - 26"        
## [4] "Front underview of Cybertruck during crash test"
## [5] "Daily Thread - April 19, 2023"                  
## [6] "Sodium Ion Batteries for Vehicles // Analysis"

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 152

## $content
## [1] "Daily Thread - April 21, 2023"

## Error in FUN(X[[i]], ...): unused argument (mc.cores = 1)

##                  word freq
## tesla           tesla   48
## daily           daily   31
## thread         thread   30
## april           april   26
## tsla             tsla   12
## china           china   11
## price           price   11
## teslas         teslas   11
## new               new   10
## fsdbeta       fsdbeta    9
## model           model    9
## week             week    9
## car               car    8
## cuts             cuts    8
## earnings     earnings    8
## march           march    8
## sales           sales    8
## elon             elon    7
## fsd               fsd    7
## musk             musk    7
## will             will    7
## year             year    7
## demand         demand    6
## update         update    6
## community   community    5
## cybertruck cybertruck    5
## deliveries deliveries    5
## drive           drive    5
## growth         growth    5
## says             says    5

References

‘RedditExtractoR’ - An R Package that helps you access the Reddit API: https://github.com/ivan-rivera/RedditExtractor

What Are APIs? - Simply Explained: https://www.youtube.com/watch?v=OVvTv9Hy91Q