TM Lab 1: Text Mining Basics

class: clear, title-slide, inverse, center, top, middle

# TM Lab 1: Text Mining Basics
----
### **Dr. Shiyan Jiang**
### January 30, 2023

---
# Agenda

.pull-left[
## Part 1: Research Overview
- Research question
- Word count
- Term frequency
- Inverse document frequency
- TF-IDF
]

.pull-right[

## Part 2: R Code-Along
- Tokenization
- Stemming
- Stopword
- Filter
]

---
class: clear, inverse, middle, center

# Part 1: Overview

Turn texts into numbers
---
# Research questions

.panelset[

.panel[.panel-name[Walkthrough example]

.pull-left[
What aspects of online professional development offerings do teachers find most valuable?
]

.pull-right[

|Resource...6                                                                          |Role    |
|:-------------------------------------------------------------------------------------|:-------|
|Online Learning Module (e.g. Call for Change, Understanding the Standards, NC Falcon) |Teacher |
|NA                                                                                    |NA      |
|Online Learning Module (e.g. Call for Change, Understanding the Standards, NC Falcon) |Teacher |
|NA                                                                                    |NA      |
|NA                                                                                    |NA      |
|NA                                                                                    |NA      |

]

.panel[.panel-name[Discuss]

Take a look at the dataset located [here](https://github.com/laser-institute/text-mining/tree/main/dataset) and consider the following:
- What format is this data set stored as? 
- What are some things you notice about this dataset? 
- What questions do you have about this dataset?
- What similar dataset do you have? 
- What research questions do you want to address with your dataset?

]

---

# Word count

- Review 1: This movie is very scary and long
- Review 2: This movie is not scary and is slow
- Review 3: This movie is spooky and good

.center[
<img src="img/wordcount.png" height="300px"/>
]

.footnote[
Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
]

---
# Term frequency

### The number we fill the matrix with are simply the raw count of the tokens in each document. This is called the term frequency (TF) approach.

.center[
<img src="img/termfrequency.png" height="300px"/>
]

.footnote[
Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
]

---
# IDF, TF-IDF

### IDF is a measure of how important a term is. TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents.

.center[
<img src="img/tfidf.png" height="300px"/>
]

.footnote[
Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
]

---

class: clear, inverse, middle, center

# part_2(R, code_along)

Tokenization, Stemming, Stopword, and Filter

[Text Mining_Basics]

---
# Tokenization, Stemming, Stopword, and Filter

### These are some of the methods of processing the data in text mining:

- unnest_tokens()
- wordStem() (lab 3)
- anti_join(dataframe, stop_words)
- filter()
---
class: clear, center

## .font130[.center[**Thank you!**]]
<br/>**Dr. Shiyan Jiang**<br/><mailto:sjiang24@ncsu.edu>