class: clear, title-slide, inverse, center, top, middle # TM Lab 1: Text Mining Basics ---- ### **Dr. Shiyan Jiang** ### January 30, 2023 --- # Agenda .pull-left[ ## Part 1: Research Overview - Research question - Word count - Term frequency - Inverse document frequency - TF-IDF ] .pull-right[ ## Part 2: R Code-Along - Tokenization - Stemming - Stopword - Filter ] --- class: clear, inverse, middle, center # Part 1: Overview Turn texts into numbers --- # Research questions .panelset[ .panel[.panel-name[Walkthrough example] .pull-left[ What aspects of online professional development offerings do teachers find most valuable? ] .pull-right[ |Resource...6 |Role | |:-------------------------------------------------------------------------------------|:-------| |Online Learning Module (e.g. Call for Change, Understanding the Standards, NC Falcon) |Teacher | |NA |NA | |Online Learning Module (e.g. Call for Change, Understanding the Standards, NC Falcon) |Teacher | |NA |NA | |NA |NA | |NA |NA | ] ] .panel[.panel-name[Discuss] Take a look at the dataset located [here](https://github.com/laser-institute/text-mining/tree/main/dataset) and consider the following: - What format is this data set stored as? - What are some things you notice about this dataset? - What questions do you have about this dataset? - What similar dataset do you have? - What research questions do you want to address with your dataset? ] ] --- # Word count - Review 1: This movie is very scary and long - Review 2: This movie is not scary and is slow - Review 3: This movie is spooky and good .center[ <img src="img/wordcount.png" height="300px"/> ] .footnote[ Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/ ] --- # Term frequency ### The number we fill the matrix with are simply the raw count of the tokens in each document. This is called the term frequency (TF) approach. .center[ <img src="img/termfrequency.png" height="300px"/> ] .footnote[ Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/ ] --- # IDF, TF-IDF ### IDF is a measure of how important a term is. TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents. .center[ <img src="img/tfidf.png" height="300px"/> ] .footnote[ Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/ ] --- class: clear, inverse, middle, center # part_2(R, code_along) Tokenization, Stemming, Stopword, and Filter [Text Mining_Basics] --- # Tokenization, Stemming, Stopword, and Filter ### These are some of the methods of processing the data in text mining: - unnest_tokens() - wordStem() (lab 3) - anti_join(dataframe, stop_words) - filter() --- class: clear, center ## .font130[.center[**Thank you!**]] <br/>**Dr. Shiyan Jiang**<br/><mailto:sjiang24@ncsu.edu>