Description: This meetup is for anyone interested in learning and sharing knowledge about analyzing Reddit data using R. In this tutorial, we will be using RedditExtractoR and a few other R packages to analyze a dataset of Reddit posts.
Text mining is the process of analyzing large collections of unstructured text data to discover patterns, trends, and insights. With the rise of social media platforms like Reddit, there is a wealth of information available in the form of user-generated content that can be analyzed using text mining techniques.
R is a popular programming language and environment for statistical computing and graphics, widely used in data analysis and data visualization. In recent years, it has also become a powerful tool for text mining and natural language processing.
In this Meetup event, we will explore how to use R for text mining of Reddit data. We will walk through the process of collecting data from Reddit using its API, cleaning and preprocessing the data, and applying text mining techniques such as sentiment analysis and topic modeling. By the end of the session, you will have a basic understanding of how to use R for text mining of social media data and be able to apply these techniques to other similar datasets.
This meetup is open to all skill levels.
Requirements: Participants should bring their laptops to the online event. Basic knowledge of R programming is recommended, but not required. Internet access will be required to access Yahoo Finance pages during the live coding session.
Using a few R packages, we will clean and preprocess the data to prepare it for analysis. We will remove stop words, punctuations, and URLs from the text data.
This will create a corpus of the post titles and remove punctuations, URLs, and stop words. We also perform stemming to reduce words to their root form.
We will now create a document term matrix to represent the text data.
We can now perform text analysis using tm and other packages. We will start by creating a few plots (word cloud, etc.) to visualize the most frequent words in the post titles.
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Loading required package: NLP
## Loading required package: RColorBrewer
##
## Attaching package: 'syuzhet'
## The following object is masked from 'package:rtweet':
##
## get_tokens
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Rows: 192 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, text, subreddit, url
## dbl (2): timestamp, comments
## date (1): date_utc
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [192 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date_utc : Date[1:192], format: "2023-02-19" "2023-02-14" ...
## $ timestamp: num [1:192] 1.68e+09 1.68e+09 1.68e+09 1.68e+09 1.68e+09 ...
## $ title : chr [1:192] "The Limiting Factor -- The Tesla Semi: Why Now?" "See Tesla\031s Berlin Factory In Detail, Including \034Godzilla\035 The Robot" "Tesla is \030setting the standard\031 for the EV industry, says ARK\031s Chief Futurist Brett Winston - Yahoo Finance" "FSDBeta 11.3.1 - Single Stack First Drive - Orlando FL (Chuck Cook)" ...
## $ text : chr [1:192] NA NA NA NA ...
## $ subreddit: chr [1:192] "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" "teslainvestorsclub" ...
## $ comments : num [1:192] 4 2 6 9 12 3 8 9 6 64 ...
## $ url : chr [1:192] "https://www.reddit.com/r/teslainvestorsclub/comments/116nhv1/the_limiting_factor_the_tesla_semi_why_now/" "https://www.reddit.com/r/teslainvestorsclub/comments/1125y27/see_teslas_berlin_factory_in_detail_including/" "https://www.reddit.com/r/teslainvestorsclub/comments/11gdjcd/tesla_is_setting_the_standard_for_the_ev_industry/" "https://www.reddit.com/r/teslainvestorsclub/comments/11lzdgt/fsdbeta_1131_single_stack_first_drive_orlando_fl/" ...
## - attr(*, "spec")=
## .. cols(
## .. date_utc = col_date(format = ""),
## .. timestamp = col_double(),
## .. title = col_character(),
## .. text = col_character(),
## .. subreddit = col_character(),
## .. comments = col_double(),
## .. url = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## [1] 192
## [1] 7
## chr [1:192] "The Limiting Factor -- The Tesla Semi: Why Now?" ...
## [1] "The Limiting Factor -- The Tesla Semi: Why Now?"
## [2] "See Tesla\031s Berlin Factory In Detail, Including \034Godzilla\035 The Robot"
## [3] "Tesla is \030setting the standard\031 for the EV industry, says ARK\031s Chief Futurist Brett Winston - Yahoo Finance"
## [4] "FSDBeta 11.3.1 - Single Stack First Drive - Orlando FL (Chuck Cook)"
## [5] "Munro Live Q&A Panel post-investor day"
## [6] "Tesla Master Plan 3 + Investor Day // What to Expect Ï=\v- The Limiting Factor"
## [1] "Tesla Cybertruck is starting to look more refined with new black tonneau cover"
## [2] "Tesla to halt some China production for upgrades"
## [3] "Think Tesla Is Losing Popularity? Think Again"
## [4] "Daily Thread - March 01, 2023"
## [5] "Model S and X price reduction in USA"
## [6] "How Tesla could produce a car that costs 36.9% cheaper than a Toyota Camry. (And why it won't, yet.)"
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 192
## $content
## [1] "Expected utilization of massive cashflows."
## Error in FUN(X[[i]], ...): unused argument (mc.cores = 1)
## word freq
## tesla tesla 78
## daily daily 30
## thread thread 29
## february february 23
## the the 19
## new new 15
## tsla tsla 14
## elon elon 13
## model model 12
## day day 11
## week week 11
## investor investor 10
## musk musk 10
## teslas teslas 10
## march march 9
## says says 9
## will will 9
## car car 8
## china china 7
## ford ford 7
## price price 7
## battery battery 6
## evs evs 6
## first first 6
## mexico mexico 6
## news news 6
## now now 6
## production production 6
## stock stock 6
## cars cars 5
‘RedditExtractoR’ - An R Package that helps you access the Reddit API: https://github.com/ivan-rivera/RedditExtractor
What Are APIs? - Simply Explained: https://www.youtube.com/watch?v=OVvTv9Hy91Q