Description: This meetup is for anyone interested in learning and sharing knowledge about analyzing Reddit data using R. In this tutorial, we will be using RedditExtractoR and a few other R packages to analyze a dataset of Reddit posts.
Text mining is the process of analyzing large collections of unstructured text data to discover patterns, trends, and insights. With the rise of social media platforms like Reddit, there is a wealth of information available in the form of user-generated content that can be analyzed using text mining techniques.
R is a popular programming language and environment for statistical computing and graphics, widely used in data analysis and data visualization. In recent years, it has also become a powerful tool for text mining and natural language processing.
In this Meetup event, we will explore how to use R for text mining of Reddit data. We will walk through the process of collecting data from Reddit using its API, cleaning and preprocessing the data, and applying text mining techniques such as sentiment analysis and topic modeling. By the end of the session, you will have a basic understanding of how to use R for text mining of social media data and be able to apply these techniques to other similar datasets.
This meetup is open to all skill levels.
Requirements: Participants should bring their laptops to the online event. Basic knowledge of R programming is recommended, but not required. Internet access will be required to access Yahoo Finance pages during the live coding session.
Using a few R packages, we will clean and preprocess the data to prepare it for analysis. We will remove stop words, punctuations, and URLs from the text data.
This will create a corpus of the post titles and remove punctuations, URLs, and stop words. We also perform stemming to reduce words to their root form.
We will now create a document term matrix to represent the text data.
We can now perform text analysis using tm and other packages. We will start by creating a few plots (word cloud, etc.) to visualize the most frequent words in the post titles.
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Loading required package: NLP
## Loading required package: RColorBrewer
##
## Attaching package: 'syuzhet'
## The following object is masked from 'package:rtweet':
##
## get_tokens
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Rows: 152 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, text, subreddit, url
## dbl (2): timestamp, comments
## date (1): date_utc
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [152 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date_utc : Date[1:152], format: "2023-04-01" "2023-04-11" ...
## $ timestamp: num [1:152] 1.68e+09 1.68e+09 1.68e+09 1.68e+09 1.68e+09 ...
## $ title : chr [1:152] "My Last \"Full Self Driving\" Video | AI DRIVR" "Week 62 update for #FSDBeta Community Tracker" "Daily Thread - April 11, 2023" "Daily Thread - March 31, 2023" ...
## $ text : chr [1:152] NA NA "All topics are permitted in this thread. If you are new here (or even if you're not), please skim through our ["| __truncated__ "All topics are permitted in this thread. If you are new here (or even if you're not), please skim through our ["| __truncated__ ...
## $ subreddit: chr [1:152] "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" ...
## $ comments : num [1:152] 6 6 61 130 17 26 175 0 75 8 ...
## $ url : chr [1:152] "https://www.reddit.com/r/teslainvestorsclub/comments/128qgoh/my_last_full_self_driving_video_ai_drivr/" "https://www.reddit.com/r/teslainvestorsclub/comments/12iejks/week_62_update_for_fsdbeta_community_tracker/" "https://www.reddit.com/r/teslainvestorsclub/comments/12ibm24/daily_thread_april_11_2023/" "https://www.reddit.com/r/teslainvestorsclub/comments/127cyg1/daily_thread_march_31_2023/" ...
## - attr(*, "spec")=
## .. cols(
## .. date_utc = col_date(format = ""),
## .. timestamp = col_double(),
## .. title = col_character(),
## .. text = col_character(),
## .. subreddit = col_character(),
## .. comments = col_double(),
## .. url = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## [1] 152
## [1] 7
## chr [1:152] "My Last \"Full Self Driving\" Video | AI DRIVR" ...
## [1] "My Last \"Full Self Driving\" Video | AI DRIVR"
## [2] "Week 62 update for #FSDBeta Community Tracker"
## [3] "Daily Thread - April 11, 2023"
## [4] "Daily Thread - March 31, 2023"
## [5] "Long-Term Shareholder Returns: Evidence from 64,000 Global Stocks"
## [6] "Jim Cramer really doesn't get it AKA the Cybertruck Lambo"
## [1] "Daily Thread - April 20, 2023"
## [2] "Daily Thread - April 03, 2023"
## [3] "China insurance data Week March 20 - 26"
## [4] "Front underview of Cybertruck during crash test"
## [5] "Daily Thread - April 19, 2023"
## [6] "Sodium Ion Batteries for Vehicles // Analysis"
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 152
## $content
## [1] "Daily Thread - April 21, 2023"
## Error in FUN(X[[i]], ...): unused argument (mc.cores = 1)
## word freq
## tesla tesla 48
## daily daily 31
## thread thread 30
## april april 26
## tsla tsla 12
## china china 11
## price price 11
## teslas teslas 11
## new new 10
## fsdbeta fsdbeta 9
## model model 9
## week week 9
## car car 8
## cuts cuts 8
## earnings earnings 8
## march march 8
## sales sales 8
## elon elon 7
## fsd fsd 7
## musk musk 7
## will will 7
## year year 7
## demand demand 6
## update update 6
## community community 5
## cybertruck cybertruck 5
## deliveries deliveries 5
## drive drive 5
## growth growth 5
## says says 5
‘RedditExtractoR’ - An R Package that helps you access the Reddit API: https://github.com/ivan-rivera/RedditExtractor
What Are APIs? - Simply Explained: https://www.youtube.com/watch?v=OVvTv9Hy91Q