Analytics for a Changing Climate: Week 3

2023-07-10

Class Plan

Data activity (10 min)
Intro to text mining (40 min)
Break (5 min)
Discuss Katrina, class readings (15 min)
Intro to ProQuest TDM Studio (20 min)
Problem set questions (Remainder)

Week 3 Groups!

print.data.frame(groups)

##                  group 1         group 2                 group 3
## 1 Cortez, Hugo Alexander   Cai, Qingyuan          Somyurek, Ecem
## 2  Widodo, Ignazio Marco    Gupta, Umang      Jun, Ernest Ng Wei
## 3  Leong, Wen Hou Lester Knutson, Blue C Spindler, Laine Addison
## 4        Gnanam, Akash Y Tan, Zheng Yang     Premkrishna, Shrish
##                            group 4       group 5              group 6
## 1        Saccone, Alexander Connor                Albertini, Federico
## 2 Ramos, Jessica Andria Potestades  Ng, Michelle         Shah, Jainam
## 3                                  Ning, Zhi Yan Dotson, Bianca Ciara
## 4          Alsayegh, Aisha E H M I     Su, Barry        Lim, Fang Jan
##                    group 7
## 1              Tian, Zerui
## 2         Wan Rosli, Nadia
## 3 Huynh Le Hue Tam, Vivian
## 4      Andrew Yu Ming Xin,

Data Activity

Task: communicate a message about the data using a visual approach of your choosing.
Be creative! Think about using the numeric and text information in different ways. Consider bringing in ouside resources (maps)?
Visualization does not need to be complete!

Data Visualization

Book: https://socviz.co/

Hurricane Charts

To Follow Along with Slides

https://rpubs.com/tylermcdaniel/soc128d_week3

What is TidyText?

What is Tidy Data?

Each observation is a row
Each variable a column
Each type of observational unit a table

What is Tidy Data?

Row	Person	Birthday	Occupation
1	Joe	12/3/1963	Carpenter
2	Malik	6/8/1978	Architect
3	Suzanna	4/3/2001	Student

What is Tidy Data?

Row	County	Temperature	PM2.5
1	Santa Clara	78.1	12.1
2	San Mateo	82.3	32.1
3	San Francisco	65.4	44.7

What is TidyText Data?

How should we organize data with text?
For example, newspaper articles

What is TidyText Data?

Row	Paper	Article	Text
1	New York Times	Study Compares Gas Stove Pollu…	Using
2	New York Times	Study Compares Gas Stove Pollu…	a
3	New York Times	Study Compares Gas Stove Pollu…	single

What is TidyText Data?

Row	Paper	Article	Text
1	New York Times	Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke	Using a single gas-stove burner can raise indoor concentrations of benzene, …
2	New York Times	Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke	For the peer-reviewed study, researchers at Stanford’s Doerr School of Sustainability …
3	New York Times	Study Compares Gas Stove Pollution to Secondhand Cigarette Smoke	In about a third of the homes, a single gas burner …

What is TidyText Data?

What about the Guardian news data that we pulled earlier - is this in tidy text format?
Read it in to your R environment, and if not, try to convert it into a tidy text table.

library(readr)
ca_wf <- read_csv("ca_wf.csv")

Refresher on Pipes

What does %>% do?
e.g. what would ca_wf %>% mutate(new_var = "") do?
What about %<>%?
Discuss in groups

Working with TidyText Data

Let’s first limit our dataset to blogs using filter()
Are these data in tidytext format?

library(tidytext)
library(dplyr)

# first, set up liveblog dataframe
tidy_blogs <- ca_wf %>%
  filter(type == "liveblog")

Working with TidyText Data

Let’s say we want each row to be a word

# unnest tokens
tidy_blogs %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

Working with TidyText Data

A lot is going on here!
What is unnest_tokens() doing?
What about the anti_join(stop_words)?

# unnest tokens
tidy_blogs %<>%
  unnest_tokens(word, body_text) %>%
  anti_join(stop_words)

Working with TidyText Data

What are stop_words? You can run View(stop_words) to look at these
Three different lexicons: SMART, snowball, and onix
Stop words are essentially words that are not useful for our analyses, such as “the”
Are there any surprising words there?

Working with TidyText Data

What happens when we anti_join the stop_words?
Let’s take a closer look at joins

Working with TidyText Data

Let’s look at the result

# look at examples
tidy_blogs %>%
  select(type, word) %>%
  head()

## # A tibble: 6 × 2
##   type     word    
##   <chr>    <chr>   
## 1 liveblog 6pm     
## 2 liveblog york    
## 3 liveblog city    
## 4 liveblog skies   
## 5 liveblog shrouded
## 6 liveblog thick

Working with TidyText Data

Now that we have a tokenized dataset, new analyses become simple
For example can use the count() function to get word frequencies

# look at blog word frequencies
tidy_blogs %>%
  count(word, sort = TRUE)

## # A tibble: 2,379 × 2
##    word          n
##    <chr>     <int>
##  1 air         111
##  2 quality      69
##  3 smoke        68
##  4 wildfires    64
##  5 trump        59
##  6 pence        58
##  7 york         58
##  8 canada       55
##  9 president    55
## 10 city         46
## # ℹ 2,369 more rows

Working with TidyText Data

What if we repeat the prior steps for articles, not blogs?

# look at article frequencies
tidy_articles %>%
  count(word, sort = TRUE)

## # A tibble: 1,808 × 2
##    word          n
##    <chr>     <int>
##  1 air          96
##  2 smoke        58
##  3 quality      50
##  4 york         40
##  5 canada       37
##  6 wildfires    34
##  7 climate      33
##  8 fires        32
##  9 city         31
## 10 wednesday    30
## # ℹ 1,798 more rows

Working with TidyText Data

library(tidyr)
frequency <- bind_rows(tidy_blogs,
                       tidy_articles) %>% 
  count(type, word) %>%
  group_by(type) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = type, values_from = proportion)

Working with TidyText Data

Hurricane Katrina

Costliest storm in U.S. history
An “unnatural disaster”

Hurricane Katrina

In groups, discuss:
What occurred during Hurricane Katrina, in terms of the natural disaster?
What elements were “unnatural?”

Hurricane Katrina

Failure of levees and city infrastructure

Hurricane Katrina

Failure of governmental and social responses

Hurricane Katrina

Uneven evacuation and return-migration

Hurricane Katrina

Ongoing disruption for those who were displaced

Class Plan

Data activity (10 min)
Intro to ProQuest TDM Studio (10 min)
Text mining congressional records (40 min)
Break (5 min)
Using ProQuest TDM Studio (20 min)
Problem set questions (Remainder)

Week 3 Groups!

print.data.frame(groups)

##                  group 1         group 2                 group 3
## 1 Cortez, Hugo Alexander   Cai, Qingyuan          Somyurek, Ecem
## 2  Widodo, Ignazio Marco    Gupta, Umang      Jun, Ernest Ng Wei
## 3  Leong, Wen Hou Lester Knutson, Blue C Spindler, Laine Addison
## 4        Gnanam, Akash Y Tan, Zheng Yang     Premkrishna, Shrish
##                            group 4       group 5              group 6
## 1        Saccone, Alexander Connor                Albertini, Federico
## 2 Ramos, Jessica Andria Potestades  Ng, Michelle         Shah, Jainam
## 3                                  Ning, Zhi Yan Dotson, Bianca Ciara
## 4          Alsayegh, Aisha E H M I     Su, Barry        Lim, Fang Jan
##                    group 7
## 1              Tian, Zerui
## 2         Wan Rosli, Nadia
## 3 Huynh Le Hue Tam, Vivian
## 4      Andrew Yu Ming Xin,

Data Activity

Task: communicate a message about the data using a visual approach of your choosing.
Be creative! Think about using the numeric and text information in different ways. Consider bringing in ouside resources (maps)?

Hurricane Charts

Vermont Flooding

Gathering Data from ProQuest

What is ProQuest?
Repository of newspapers, academic articles, dissertations, government records, and more

Gathering Data from ProQuest

Pro: can analyze data on the cloud. This is great when working with large datasets, as it frees up your local machine to do whatever else you want to use it for.
Con: Slow and finicky. This could change, but as of now, the ProQuest TDM system is not super fast, and may freeze or crash on you. Just something to be aware of.
Con: Time required to output results. You can’t just save your outputs from ProQuest TDM to your desktop and use them, you need to have them approved by the ProQuest team. This process shouldn’t take too long (usually less than an hour), but it could take longer, so it’s important to budget this into your project timeline if you use these data!

Gathering Data from ProQuest

Create an account on TDM Studio!
https://tdmstudio.proquest.com/home

Gathering Data from ProQuest

Create a visualization project
Any interesting results?

Gathering Data from ProQuest

Go to Workbench
Create a dataset!
Congressional Hearings
Part C
Review project, give it a simple name and description

Text Mining Congressional Records

Can download records here:
https://www.govinfo.gov/app/collection/chrg/

Text Mining Congressional Records

library(pdftools)
# read hearing into R
levees_hearing <- pdf_text("G:/My Drive/Data_Disasters/Course_site/Data/Katrina_hearings/katrina_hearing_levees.pdf") %>%
  as.data.frame()

# set column names 
colnames(levees_hearing) <- c("text")

What should we do next?

Text Mining Congressional Records

Make the data “tidytext”!
Take a look at the data

# unnest tokens
tidy_hearing <- levees_hearing %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
  
# take a look 
head(tidy_hearing)

##        word
## 1       hrg
## 2       109
## 3       526
## 4 hurricane
## 5   katrina
## 6    levees

Text Mining Congressional Records

Examine top words
What’s going on here?

# look at top words
tidy_hearing_counts <- tidy_hearing %>%
  count(word, sort = TRUE)


# now we can look at the top 20
head(tidy_hearing_counts, 20)

##        word   n
## 1        09 726
## 2      6601 617
## 3      2006 365
## 4        37 365
## 5      2002 364
## 6        31 364
## 7     00000 363
## 8    024446 363
## 9       0ct 363
## 10    24446 363
## 11      aug 363
## 12     docs 363
## 13      fmt 363
## 14      frm 363
## 15      jkt 363
## 16      pat 363
## 17       po 363
## 18      psn 363
## 19 saffairs 363
## 20     sfmt 363

Text Mining Congressional Records

Identify additional stop words

library(magrittr)
# vector for additional stop words
addl_stop_words <- tidy_hearing_counts %>%
  filter(n > 300 ) %>%
  select(word)

# include 6633 as well
addl_stop_words %<>%
  bind_rows(data.frame(word = "6633"))

Text Mining Congressional Records

Add them to stop word list

# add additional stop words
custom_stop_words <- bind_rows(data.frame(word = addl_stop_words,  
                                      lexicon = c("custom")), 
                               stop_words)

# examine new stop words dataset
head(custom_stop_words)

##   word lexicon
## 1   09  custom
## 2 6601  custom
## 3 2006  custom
## 4   37  custom
## 5 2002  custom
## 6   31  custom

Text Mining Congressional Records

Remove our custom stop words

# remove custom stop words
tidy_hearing %<>%
  anti_join(custom_stop_words)

# new top words
tidy_hearing_counts <- tidy_hearing %>%
  count(word, sort = TRUE)

Text Mining Congressional Records

Check to see if this fixed our problem

# now we can look at the top 20
head(tidy_hearing_counts, 20)

##         word   n
## 1     levees 127
## 2    senator 126
## 3      corps 121
## 4      levee  92
## 5    orleans  91
## 6      slide  86
## 7         dr  80
## 8       seed  76
## 9  hurricane  72
## 10  chairman  62
## 11     water  62
## 12 engineers  60
## 13       van  57
## 14   heerden  56
## 15     level  56
## 16     canal  55
## 17     storm  54
## 18     surge  48
## 19 lieberman  47
## 20 nicholson  44

Text Mining Congressional Records

When combining multiple documents
Process overall is very similar
bind_rows() and group_by() might be useful
What are these doing?

Text Mining Congressional Records

Can combine quantitative text analysis with selective reading!
Lieberman: “if the levees had done what they were designed to do, a lot of the flooding of New Orleans would not have occurred, and a lot of the suffering that occurred as a result of the flooding would not have occurred,” (confirmed by Dr. Seed, who was leading the National Science Foundation’s investigation of the levees).
“slide” used as in “next slide, please”

Text Mining Congressional Records

Can combine quantitative text analysis with selective reading!

Gathering Data from ProQuest

Getting started with Jupyter Notebooks
Open Jupyter Notebook
Look at dataset, copy name
Go to Getting Started R -> 2022.05.13 -> ProQuest TDM Studio R Samples
Open R Convert to Dataframe
Make one change: replace “SAMPLEDATA” with the name of your dataset
Run the chunks one by one

Gathering Data from ProQuest

Now, within ProQuest TDM Studio R Samples
Create a new Jupyter Notebook!

Gathering Data from ProQuest

To use packages like tidytext in ProQuest
Begin a new Jupyter Notebook by following the instructions in Getting Started/2022.05.25/ProQuest TDM Studio Manuals/TDM_Studio_Manual.ipynb.

Problem Set 3

Similar to Problem Sets 1 and 2
Update: since ProQuest is so glitchy, I will be very lenient for Questions 2 and 3
Please do not waste hours on this website if it is not of interest!

A note on markdown

Example file added to Canvas!

Final Project Proposal

Teams of up to 4
Still time to change after proposal
Include research question, motivation, data, and methods
The more details, the better

Mid-Course Feedback

https://forms.gle/muNNUdT6NFyfA3r8A