The social construction of ‘Health’

class: center, middle, inverse, title-slide

.title[
# The social construction of ‘Health’
]
.subtitle[
## A data exploration using Twitter’s API
]
.author[
### Victoria Sass
]
.institute[
### UW Sociology
]
.date[
### 10 January 2023
]

---

# Using APIs

In the previous example we directly scraped text from one webpage of interest. This can be useful if you have a few specific pages of interest but if you're in a more discovery phase of your research process (as I am), you may want to collect a larger trove of data using a more systematic approach. Fortunately, some websites have created a way to do this using something called an Application Programming Interface (API).

---

# My research

A bit about my research interests: I'm a PhD student in the Sociology department and my dissertation broadly focuses on the social construction of health and health problems. I'm interested specifically in the social construction of "obesity" and the interactive mental and physical health effects of an overwhelming medical/public health/cultural prescription to diet.

Part of my work looks at existing health datasets to understand the relationship between dieting and mental/physical health outcomes. But I'm also interested in the ways people talk about health. This latter aspect of my work still has a ways to go towards being fleshed out and therefore I'd like to explore a corpus of tweets that contain certain words of interest as a starting point.

---

# Twitter's API

To utilize Twitter's API you need to sign up for an account (or use an existing account) and apply for "developer" access. You can get instant access by applying for an **Essential** account (limited access) which enables you to retrieve up to 500,000 tweets per month. You can also apply for an **Academic Research** account which involves a more thorough application process (about your specific research aims and usage of the data) but allows you to retrieve up to 10,000,000 tweets per month with full access to the archive of historical tweets. I'm currently waiting on approval for the latter so we'll proceed in this example with a more basic example of authentication.

More information about developer accounts can be found [here](https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api).

---

# Authentication

First we want to load the necessary packages:

```r
library(tidyverse) # suite of useful data manipulation and tidying packages
library(tidytext) # package to create tidy text data
library(rtweet) # package to pull Twitter data using their API
library(wordcloud2) # easy wordcloud visualizations
library(ggthemes) # nice themes for ggplot
```

Once approved for a developer account you will create an app and then will be given four passwords which will serve to authenticate you requests. You want to store these outside of your R script for privacy reasons, especially if you intend to share your code with others. For this example we're simply going to use a basic form of authentication which allows you to utilize the twitter account logged in on your default browser. Rate limits for this type of authentication are 18,000 tweets every 15 minutes so it's considerably more limited than authentication methods using an app or a bot.
More info on authenticating with `rtweet` can be found [here](https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html) or by running `vignette('auth')` in the console.

---

# Authentication

The following line of code only needs to be run once per computer.

```r
auth_setup_default()
```

---

# Getting tweets

Now we can pull tweets of interest. For my work I want to start by seeing what kinds of words accompany three general themes: health, dieting, and wellness.

```r
health <- search_tweets(q = "health", n = 10000, include_rts = FALSE, lang = "en")
dieting <- search_tweets(q = "diet OR dieting OR lose weight", n = 10000, include_rts = FALSE, lang = "en")
wellness <- search_tweets(q = "wellness", n = 10000, include_rts = FALSE, lang = "en")
```

---

# Getting tweets

A preview of one of these datasets gives you a sense of what's available:

```r
names(wellness)
```

```
##  [1] "created_at"                    "id"                           
##  [3] "id_str"                        "full_text"                    
##  [5] "truncated"                     "display_text_range"           
##  [7] "entities"                      "metadata"                     
##  [9] "source"                        "in_reply_to_status_id"        
## [11] "in_reply_to_status_id_str"     "in_reply_to_user_id"          
## [13] "in_reply_to_user_id_str"       "in_reply_to_screen_name"      
## [15] "geo"                           "coordinates"                  
## [17] "place"                         "contributors"                 
## [19] "is_quote_status"               "retweet_count"                
## [21] "favorite_count"                "favorited"                    
## [23] "retweeted"                     "lang"                         
## [25] "possibly_sensitive"            "quoted_status_id"             
## [27] "quoted_status_id_str"          "quoted_status"                
## [29] "text"                          "favorited_by"                 
## [31] "scopes"                        "display_text_width"           
## [33] "retweeted_status"              "quoted_status_permalink"      
## [35] "quote_count"                   "timestamp_ms"                 
## [37] "reply_count"                   "filter_level"                 
## [39] "query"                         "withheld_scope"               
## [41] "withheld_copyright"            "withheld_in_countries"        
## [43] "possibly_sensitive_appealable"
```

---

# Getting tweets

A preview of one of these datasets gives you a sense of what's available:

```r
wellness$full_text[1]
```

```
## Prioritizing health and wellness in 2023.
```

```r
wellness$full_text[2]
```

```
## In 2022, US cops killed more people than any year on record.
## 
## • 32% were fleeing 
## • 11% committed no offense 
## • 9% during mental health / wellness checks
## • 8% involved traffic infractions 
## • 18% for non-violent offenses 
## 
## Cops killed +3 everyday for mostly arbitrary reasons. https://t.co/J0HwJYvvVO
```

```r
wellness$full_text[3]
```

```
## Healthcare worker and admin at “wellness” seminar https://t.co/mUoca280Ki
```

---

# Tokenizing

Now that we have pulled tweets from three topics of interest we can tokenize them and remove stop words.

```r
tidy_health <- tibble(tweet = 1:9998, text = health$full_text)
tidy_dieting <- tibble(tweet = 1:9972, text = dieting$full_text)
tidy_wellness <- tibble(tweet = 1:9994, text = wellness$full_text)

data("stop_words")

tidy_health2 <- tidy_health %>% 
  unnest_tokens(word, text) %>% # tokenizing by word
  anti_join(stop_words) %>% # removing stop words
  filter(word != "health") %>% # removing the key word itself
  filter(!(word %in% c("t.co", "https", "amp", "it’s", "i’m", "don’t"))) %>% # removing random stop words specific to tweets
  filter(grepl("[^0-9]", word)) # removing numbers by using regex search
```

---

# Tidy text

Now we have our data in tidy text format

```r
tidy_health2 %>%  
  count(word, sort = TRUE)
```

```
## # A tibble: 33,214 × 2
##    word        n
##    <chr>   <int>
##  1 mental   2177
##  2 care     1174
##  3 people    864
##  4 public    502
##  5 time      445
##  6 covid     442
##  7 issues    402
##  8 life      399
##  9 system    330
## 10 medical   286
## # … with 33,204 more rows
```

---

# Visualize word frequencies

If we tidy up the other two datasets we can start to create some visualizations

```r
# code for wordcloud
tidy_health2 %>%
  count(word) %>%
  filter(n > 50) %>% 
  wordcloud2()

# code for bar chart
tidy_health2 %>%
  count(word, sort = TRUE) %>% # calculating freqencies and sorting in descending order (default)
  filter(n > 200) %>% # just looking at words mentioned more than 5 times in the article
  mutate(word = reorder(word, n), # reordering the words in the tibble by their frequency
         color = case_when(n >= 500 ~ "1", # creating color categories for different frequencies 
                           n >= 400 ~ "2", 
                           n >= 300 ~ "3", 
                           n < 300 ~ "4", 
                           TRUE ~ NA_character_)) %>%
  ggplot(aes(n, word)) + # main plot call
  geom_col(aes(fill = color)) + # geom for columns
  scale_fill_manual(values = c("1" = "#4e79a7", "2" = "#f28e2c", "3" = "#e15759", "4" = "#76b7b2")) + # manual colors
  geom_text(aes(label = n), nudge_x = 2, color = "black") + # adding labels of frequencies
  labs(y = NULL) + # removing y axis name
  theme_tufte() + # changine theme to tufte (minimal aesthetic)
  theme(legend.position = "none") # removing color legend
```

![health wordcloud](health_wordcloud.png)
---

# Health wordcloud

![health wordcloud](health_wordcloud.png)

---

# Health bar chart

![](Twitter-API_files/figure-html/unnamed-chunk-17-1.png)

---

# Dataset comparisons

```r
# code to make frequency comparisons
frequency <- bind_rows(mutate(tidy_dieting2, keyword = "Dieting"), # binding rows together
                       mutate(tidy_wellness2, keyword = "Wellness"), 
                       mutate(tidy_health2, keyword = "Health")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>% # extracting just text, excluding punctuation denoting style
  count(keyword, word) %>% 
  filter(n > 20) %>% 
  group_by(keyword) %>% # grouping dataset but the three keywords
  mutate(proportion = n / sum(n)) %>% # getting a proportion of each word within each keyword-specific corpus
  select(-n) %>% # removing n from dataset
  pivot_wider(names_from = keyword, values_from = proportion) %>% # creating columns for each keyword w/ proportion as the value
  pivot_longer(c(`Dieting`, `Wellness`),
               names_to = "keyword", values_to = "proportion") # keeping the Health column and collapsing Dieting and Wellness back into a keyword column so we can see proportions compared to health within 1 row
```

---

# Dataset comparisons

```r
frequency
```

```
## # A tibble: 3,646 × 4
##    word          Health keyword  proportion
##    <chr>          <dbl> <chr>         <dbl>
##  1 a           0.000905 Dieting    0.00213 
##  2 a           0.000905 Wellness   0.00133 
##  3 absolutely  0.000815 Dieting    0.000477
##  4 absolutely  0.000815 Wellness  NA       
##  5 abt        NA        Dieting    0.000562
##  6 abt        NA        Wellness  NA       
##  7 accept     NA        Dieting    0.000375
##  8 accept     NA        Wellness  NA       
##  9 achieve    NA        Dieting    0.00102 
## 10 achieve    NA        Wellness   0.00158 
## # … with 3,636 more rows
```

---

# Dataset comparisons

```r
# code to make comparison plot
library(scales)

ggplot(frequency, aes(x = proportion, y = `Health`, 
                      color = abs(`Health` - proportion))) + 
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~keyword, ncol = 2) +
  labs(y = "Health", x = NULL) +
  theme_tufte() + 
  theme(legend.position="none")
```

---

# Dataset comparisons

![](Twitter-API_files/figure-html/unnamed-chunk-22-1.png)

---

# Correlation of relative frequencies

```r
# Correlation between Health and Dieting tweets
cor.test(data = frequency[frequency$keyword == "Dieting",],
         ~ proportion + `Health`)
```

```
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Health
## t = 4.588, df = 505, p-value = 5.652e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1149573 0.2822040
## sample estimates:
##       cor 
## 0.2000374
```

---

# Correlation of relative frequencies

```r
# Correlation between Health and Wellness tweets
cor.test(data = frequency[frequency$keyword == "Wellness",],
         ~ proportion + `Health`)
```

```
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Health
## t = 12.171, df = 609, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3761925 0.5039532
## sample estimates:
##       cor 
## 0.4423141
```

We can see the words in the selection of tweets referencing health are more correlated with wellness-related tweets (0.442) than dieting-related tweets (0.200), though neither are particularly strong.

---

# Discussion

Principles of Selection and Representation

- Principle 1: Question-Specific Corpus Construction

Using tweets naturally comes with limitations as not everyone writes down their thoughts/feelings about health-related topics, much less uses twitter to do so. Nonetheless, twitter data can still be of use to gain a better and broader understanding of the opinions of people who do tweet about these topics.

- Principle 2: No Values-Free Corpus Construction

In addition to the above limitations, the type of speech that occurs around topics such as "obesity," body size, and the morality associated with health status warrants additional sensitivity. Hate speech is rife on twitter and when it comes to anti-fat bias and I'll want to be careful going forward about how I'm contextualizing my analysis to avoid perpetuating harmful speech.

---

# Discussion

- Principle 3: No Right Way to Represent Text

This exploration of a limited amount of data was a good first step but I'll definitely want to analyze many additional facets of tweets of interest, i.e. sentiment analysis or topic modeling. With more exploration I hope to have a better sense of the research questions I'm interested in and therefore, which methods wil be most appropriate.

- Principle 4: Validation

Of course, extensive validation will be a part of whatever method I choose. Comparing methods across different types of data sources may also be interesting, i.e. how are these topics discussed on Twitter versus Reddit, versus Instagram.

---

# Discussion

Selecting Documents

When my developer account for research purposes is approved I expect recreating and expanding this exploration to include more texts across a longer range of time as well as more keywords related to the topics I'm interested in.

- Resource Bias

Again, twitter is not representative of the entire population of English speakers and therefore my analysis will be limited in its generalizability. However, I do think much can be learned from the way a broader swath of the population (i.e. not just medical doctors, public health officials, and people directly involved in the health/wellness industrial complex) are discussing these topics.

- Incentive Bias

I'll want to consider that not all tweets have been retained (some have been deleted by the user themselves, others have been removed due to violations of Twitter's use agreement). Also, some people may not want to share their real opinions about the topics I'm interested in for fear of being perceived a certain way.

---

# Discussion

- Medium Bias

I will definitely want to engage with the full text of the tweets themselves to gain a better understanding about their context. This will help inform the types of analysis I consider doing and even what questions I am interested in answering. Additionally, other forms of media (pictures, links, emojis) may provide additional information that informs those decisions.

Additionally, (and a counter to social desirability bias) there is a tendency for people to express more negative opinions in online spaces where they are (or feel) anonymous. I'll want to think critically about what this means for the social phenomena I'm expecting to witness versus what I actually capture using this platform as a data source.

---

# Discussion

- Retrieval Bias

Coming up with a list of keywords has an inherent bias as it is subject to my understanding of the social processes I'm interested and that I think have potentially interesting relationships with one another. Validation and iteration throughout the research process, particularly the discovery phase will thus be very important.

Additionally, I'll want to gain a more thorough understanding of the Twitter API to make sure I understand its limitations and how they may affect the types of tweets I have access to.