2023-07-10

Class Plan

  • Data activity (10 min)
  • Intro to text mining (40 min)
  • Break (5 min)
  • Discuss Katrina, class readings (15 min)
  • Intro to ProQuest TDM Studio (20 min)
  • Problem set questions (Remainder)

Week 3 Groups!

print.data.frame(groups)
##                  group 1         group 2                 group 3
## 1 Cortez, Hugo Alexander   Cai, Qingyuan          Somyurek, Ecem
## 2  Widodo, Ignazio Marco    Gupta, Umang      Jun, Ernest Ng Wei
## 3  Leong, Wen Hou Lester Knutson, Blue C Spindler, Laine Addison
## 4        Gnanam, Akash Y Tan, Zheng Yang     Premkrishna, Shrish
##                            group 4       group 5              group 6
## 1        Saccone, Alexander Connor                Albertini, Federico
## 2 Ramos, Jessica Andria Potestades  Ng, Michelle         Shah, Jainam
## 3                                  Ning, Zhi Yan Dotson, Bianca Ciara
## 4          Alsayegh, Aisha E H M I     Su, Barry        Lim, Fang Jan
##                    group 7
## 1              Tian, Zerui
## 2         Wan Rosli, Nadia
## 3 Huynh Le Hue Tam, Vivian
## 4      Andrew Yu Ming Xin,

Data Activity

  • Task: communicate a message about the data using a visual approach of your choosing.
  • Be creative! Think about using the numeric and text information in different ways. Consider bringing in ouside resources (maps)?
  • Visualization does not need to be complete!

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Hurricane Charts

Hurricane Charts

To Follow Along with Slides

What is TidyText?

What is Tidy Data?

  • Each observation is a row
  • Each variable a column
  • Each type of observational unit a table

What is Tidy Data?

Row Person Birthday Occupation
1 Joe 12/3/1963 Carpenter
2 Malik 6/8/1978 Architect
3 Suzanna 4/3/2001 Student

What is Tidy Data?

Row County Temperature PM2.5
1 Santa Clara 78.1 12.1
2 San Mateo 82.3 32.1
3 San Francisco 65.4 44.7

What is TidyText Data?

  • How should we organize data with text?
  • For example, newspaper articles

What is TidyText Data?

Row Paper Article Text
1 New York Times Study Compares Gas Stove Pollu… Using
2 New York Times Study Compares Gas Stove Pollu… a
3 New York Times Study Compares Gas Stove Pollu… single

What is TidyText Data?

Row Paper Article Text
1 New York Times Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke Using a single gas-stove burner can raise indoor concentrations of benzene, …
2 New York Times Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke For the peer-reviewed study, researchers at Stanford’s Doerr School of Sustainability …
3 New York Times Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke In about a third of the homes, a single gas burner …

What is TidyText Data?

  • What about the Guardian news data that we pulled earlier - is this in tidy text format?
  • Read it in to your R environment, and if not, try to convert it into a tidy text table.
library(readr)
ca_wf <- read_csv("ca_wf.csv")

Refresher on Pipes

  • What does %>% do?
  • e.g. what would ca_wf %>% mutate(new_var = "") do?
  • What about %<>%?
  • Discuss in groups

Working with TidyText Data

  • Let’s first limit our dataset to blogs using filter()
  • Are these data in tidytext format?
library(tidytext)
library(dplyr)

# first, set up liveblog dataframe
tidy_blogs <- ca_wf %>%
  filter(type == "liveblog")

Working with TidyText Data

  • Let’s say we want each row to be a word
# unnest tokens
tidy_blogs %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

Working with TidyText Data

  • A lot is going on here!
  • What is unnest_tokens() doing?
  • What about the anti_join(stop_words)?
# unnest tokens
tidy_blogs %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

Working with TidyText Data

  • What are stop_words? You can run View(stop_words) to look at these
  • Three different lexicons: SMART, snowball, and onix
  • Stop words are essentially words that are not useful for our analyses, such as “the”
  • Are there any surprising words there?

Working with TidyText Data

  • What happens when we anti_join the stop_words?
  • Let’s take a closer look at joins

Working with TidyText Data

Working with TidyText Data

  • Let’s look at the result
# look at examples
tidy_blogs %>%
  select(type, word) %>%
  head()
## # A tibble: 6 × 2
##   type     word    
##   <chr>    <chr>   
## 1 liveblog 6pm     
## 2 liveblog york    
## 3 liveblog city    
## 4 liveblog skies   
## 5 liveblog shrouded
## 6 liveblog thick

Working with TidyText Data

  • Now that we have a tokenized dataset, new analyses become simple
  • For example can use the count() function to get word frequencies
# look at blog word frequencies
tidy_blogs %>%
  count(word, sort = TRUE)
## # A tibble: 2,379 × 2
##    word          n
##    <chr>     <int>
##  1 air         111
##  2 quality      69
##  3 smoke        68
##  4 wildfires    64
##  5 trump        59
##  6 pence        58
##  7 york         58
##  8 canada       55
##  9 president    55
## 10 city         46
## # ℹ 2,369 more rows

Working with TidyText Data

  • What if we repeat the prior steps for articles, not blogs?
# look at article frequencies
tidy_articles %>%
  count(word, sort = TRUE)
## # A tibble: 1,808 × 2
##    word          n
##    <chr>     <int>
##  1 air          96
##  2 smoke        58
##  3 quality      50
##  4 york         40
##  5 canada       37
##  6 wildfires    34
##  7 climate      33
##  8 fires        32
##  9 city         31
## 10 wednesday    30
## # ℹ 1,798 more rows

Working with TidyText Data

library(tidyr)
frequency <- bind_rows(tidy_blogs,
                       tidy_articles) %>% 
  count(type, word) %>%
  group_by(type) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = type, values_from = proportion) 

Working with TidyText Data

Hurricane Katrina

Hurricane Katrina

  • Costliest storm in U.S. history
  • An “unnatural disaster”

Hurricane Katrina

  • In groups, discuss:
  • What occurred during Hurricane Katrina, in terms of the natural disaster?
  • What elements were “unnatural?”

Hurricane Katrina

  • Failure of levees and city infrastructure

Hurricane Katrina

  • Failure of governmental and social responses

Hurricane Katrina

  • Uneven evacuation and return-migration

Hurricane Katrina

  • Ongoing disruption for those who were displaced

Class Plan

  • Data activity (10 min)
  • Intro to ProQuest TDM Studio (10 min)
  • Text mining congressional records (40 min)
  • Break (5 min)
  • Using ProQuest TDM Studio (20 min)
  • Problem set questions (Remainder)

Week 3 Groups!

print.data.frame(groups)
##                  group 1         group 2                 group 3
## 1 Cortez, Hugo Alexander   Cai, Qingyuan          Somyurek, Ecem
## 2  Widodo, Ignazio Marco    Gupta, Umang      Jun, Ernest Ng Wei
## 3  Leong, Wen Hou Lester Knutson, Blue C Spindler, Laine Addison
## 4        Gnanam, Akash Y Tan, Zheng Yang     Premkrishna, Shrish
##                            group 4       group 5              group 6
## 1        Saccone, Alexander Connor                Albertini, Federico
## 2 Ramos, Jessica Andria Potestades  Ng, Michelle         Shah, Jainam
## 3                                  Ning, Zhi Yan Dotson, Bianca Ciara
## 4          Alsayegh, Aisha E H M I     Su, Barry        Lim, Fang Jan
##                    group 7
## 1              Tian, Zerui
## 2         Wan Rosli, Nadia
## 3 Huynh Le Hue Tam, Vivian
## 4      Andrew Yu Ming Xin,

Data Activity

  • Task: communicate a message about the data using a visual approach of your choosing.
  • Be creative! Think about using the numeric and text information in different ways. Consider bringing in ouside resources (maps)?

Hurricane Charts

Hurricane Charts

Vermont Flooding

Vermont Flooding

Gathering Data from ProQuest

Gathering Data from ProQuest

  • What is ProQuest?
  • Repository of newspapers, academic articles, dissertations, government records, and more

Gathering Data from ProQuest

  • Pro: can analyze data on the cloud. This is great when working with large datasets, as it frees up your local machine to do whatever else you want to use it for.
  • Con: Slow and finicky. This could change, but as of now, the ProQuest TDM system is not super fast, and may freeze or crash on you. Just something to be aware of.
  • Con: Time required to output results. You can’t just save your outputs from ProQuest TDM to your desktop and use them, you need to have them approved by the ProQuest team. This process shouldn’t take too long (usually less than an hour), but it could take longer, so it’s important to budget this into your project timeline if you use these data!

Gathering Data from ProQuest

Gathering Data from ProQuest

  • Create a visualization project
  • Any interesting results?

Gathering Data from ProQuest

  • Go to Workbench
  • Create a dataset!
  • Congressional Hearings
  • Part C
  • Review project, give it a simple name and description

Text Mining Congressional Records

Text Mining Congressional Records

Text Mining Congressional Records

library(pdftools)
# read hearing into R
levees_hearing <- pdf_text("G:/My Drive/Data_Disasters/Course_site/Data/Katrina_hearings/katrina_hearing_levees.pdf") %>%
  as.data.frame()

# set column names 
colnames(levees_hearing) <- c("text")
  • What should we do next?

Text Mining Congressional Records

  • Make the data “tidytext”!
  • Take a look at the data
# unnest tokens
tidy_hearing <- levees_hearing %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
  
# take a look 
head(tidy_hearing)
##        word
## 1       hrg
## 2       109
## 3       526
## 4 hurricane
## 5   katrina
## 6    levees

Text Mining Congressional Records

  • Examine top words
  • What’s going on here?
# look at top words
tidy_hearing_counts <- tidy_hearing %>%
  count(word, sort = TRUE)


# now we can look at the top 20
head(tidy_hearing_counts, 20)
##        word   n
## 1        09 726
## 2      6601 617
## 3      2006 365
## 4        37 365
## 5      2002 364
## 6        31 364
## 7     00000 363
## 8    024446 363
## 9       0ct 363
## 10    24446 363
## 11      aug 363
## 12     docs 363
## 13      fmt 363
## 14      frm 363
## 15      jkt 363
## 16      pat 363
## 17       po 363
## 18      psn 363
## 19 saffairs 363
## 20     sfmt 363

Text Mining Congressional Records

  • Identify additional stop words
library(magrittr)
# vector for additional stop words
addl_stop_words <- tidy_hearing_counts %>%
  filter(n > 300 ) %>%
  select(word)

# include 6633 as well
addl_stop_words %<>%
  bind_rows(data.frame(word = "6633"))

Text Mining Congressional Records

  • Add them to stop word list
# add additional stop words
custom_stop_words <- bind_rows(data.frame(word = addl_stop_words,  
                                      lexicon = c("custom")), 
                               stop_words)

# examine new stop words dataset
head(custom_stop_words)
##   word lexicon
## 1   09  custom
## 2 6601  custom
## 3 2006  custom
## 4   37  custom
## 5 2002  custom
## 6   31  custom

Text Mining Congressional Records

  • Remove our custom stop words
# remove custom stop words
tidy_hearing %<>%
  anti_join(custom_stop_words)

# new top words
tidy_hearing_counts <- tidy_hearing %>%
  count(word, sort = TRUE)

Text Mining Congressional Records

  • Check to see if this fixed our problem
# now we can look at the top 20
head(tidy_hearing_counts, 20)
##         word   n
## 1     levees 127
## 2    senator 126
## 3      corps 121
## 4      levee  92
## 5    orleans  91
## 6      slide  86
## 7         dr  80
## 8       seed  76
## 9  hurricane  72
## 10  chairman  62
## 11     water  62
## 12 engineers  60
## 13       van  57
## 14   heerden  56
## 15     level  56
## 16     canal  55
## 17     storm  54
## 18     surge  48
## 19 lieberman  47
## 20 nicholson  44

Text Mining Congressional Records

  • When combining multiple documents
  • Process overall is very similar
  • bind_rows() and group_by() might be useful
  • What are these doing?

Text Mining Congressional Records

  • Can combine quantitative text analysis with selective reading!
  • Lieberman: “if the levees had done what they were designed to do, a lot of the flooding of New Orleans would not have occurred, and a lot of the suffering that occurred as a result of the flooding would not have occurred,” (confirmed by Dr. Seed, who was leading the National Science Foundation’s investigation of the levees).
  • “slide” used as in “next slide, please”

Text Mining Congressional Records

  • Can combine quantitative text analysis with selective reading!

Gathering Data from ProQuest

Gathering Data from ProQuest

  • Getting started with Jupyter Notebooks
  • Open Jupyter Notebook
  • Look at dataset, copy name
  • Go to Getting Started R -> 2022.05.13 -> ProQuest TDM Studio R Samples
  • Open R Convert to Dataframe
  • Make one change: replace “SAMPLEDATA” with the name of your dataset
  • Run the chunks one by one

Gathering Data from ProQuest

  • Now, within ProQuest TDM Studio R Samples
  • Create a new Jupyter Notebook!

Gathering Data from ProQuest

  • To use packages like tidytext in ProQuest
  • Begin a new Jupyter Notebook by following the instructions in Getting Started/2022.05.25/ProQuest TDM Studio Manuals/TDM_Studio_Manual.ipynb.

Problem Set 3

  • Similar to Problem Sets 1 and 2
  • Update: since ProQuest is so glitchy, I will be very lenient for Questions 2 and 3
  • Please do not waste hours on this website if it is not of interest!

A note on markdown

  • Example file added to Canvas!

Final Project Proposal

  • Teams of up to 4
  • Still time to change after proposal
  • Include research question, motivation, data, and methods
  • The more details, the better

Mid-Course Feedback