Executive Summary

The exploratory analysis of the texts written by Zitkala-sa showed there were differences between Old Indian Legends and American Indian Stories. Exploring the word frequency in each book showed that Old Indian Legends primarily used words that are associated with nature and American Indian Stories used words associated with family or day-to-day lives in America. The unsupervised topic model analysis demonstrated that there are two distinct topics. Old Indian Legends contributes to Topic 1 and American Indian Stories contributes to Topic 2. The words used in each topic are consistent with the initial word frequency calculations. The topic model analysis also showed Topic 1 consisted of words associated with nature and Topic 2 consisted of words associated with family and humans. The texts written by Zitkala-sa could have similar themes to other Native American writers of the time. We recommend comparing her books to other writers at the time and more recent works by Native Americans. This will compare the themes from the 2000s and the 1900s. The follow on analysis will show how themes might have changed throughout history or have remained the same.

Background and Objectives

Zitkala-Sa was a Dakota activist and author born in 1876. Her and her husband advocated for preserving American Indian cultural sovereignty and full US citizenship rights. Zitkala-Sa wrote two books in the early 1900s called Old Indian Legends and American Indian Stories. The objective of this exploratory analysis is to determine whether or not the two texts have different themes. The data is collected from the Gutenberg Project. Her texts are indexed as 338 and 10,376 and downloaded into R via the guntenberg_download function.

Summary of Key Findings

The following plots depict the commonality of the words used in each of Zitkala-Sa books and a topic model created from her books. The data is placed into factors and stop words are filtered out. Stop words include articles, pronouns, and or any other common words spoken in the English language. The removal of the stop words allows us to focus on the words chosen by Zitkala-Sa to determine the themes of her texts. The name Iktomi was also removed since the name was used almost four times as often as other words within Old Indian Legends.

library(gutenbergr)
library(tidyverse)
library(tidytext)
library(stm)
library(quanteda)

txt<-gutenberg_download(c(338, 10376))

txt_books <- txt %>%
  mutate(book=ifelse(str_detect(text,c("OLD INDIAN LEGENDS","AMERICAN INDIAN STORIES")),text,NA))%>%
  fill(book)%>%
  filter(book!="OLD INDIAN LEGENDS" | book!="AMERICAN INDIAN STORIES") %>%
  mutate(book=factor(book,levels = unique(book)))

tidy_txt <- txt_books %>%
  mutate(line=row_number()) %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words) %>%
  filter(word!="iktomi")

Exploring Word Frequency within the Books

Below is a facet plot depicting the most commonly used words used in Old Indian Legends and American Indian Stories. The bar plots are of how often the word is written for each book written by Zitkala-Sa. As you can see from each plot there is only one common word, which is tepee, and the other words do not coincide. It appears Old Indian Legends consists of nature themes and American Indian Stories have themes about the lives of Native Americans based on the words shown.

tidy_txt %>%
  count(book,word,sort=TRUE) %>%
  bind_tf_idf(word,book,n) %>%
  group_by(book) %>%
  top_n(10) %>%
  ungroup %>%
  mutate(word=reorder(word,tf_idf)) %>%
  ggplot(aes(word,tf_idf,fill=book))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~book,scales="free")+
  coord_flip()+
  labs(title="Frequencies of Each Word",
       y="Proportionality",
       x="Words")+
  scale_fill_manual(values=c("#E69F00", "#56B4E9"))

Topic Model of the Books

Below is a topic model analysis of the two books written by Zitkala-Sa. The topic model is an unsupervised model used to determine which words contribute to each topic. Two topics were chosen since there are two books. The first plot shows the words most prevalent in each topic. As you can see from the topic model, Topic 1 is mostly words dealing with nature and Topic 2 consists of words dealing with family.

topic_df <- tidy_txt %>%
  count(book,word,sort=TRUE) %>%
  cast_dfm(book,word,n)

topic_model <- stm(topic_df,K=2,init.type = "Spectral")
## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..
##   Recovering initialization...
##      .....................................................
## Initialization complete.
## ..
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -7.539) 
## ..
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -7.521, relative change = 2.478e-03) 
## ..
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged
tidy_topic<-tidy(topic_model)
tidy_topic %>%
  group_by(topic) %>%
  top_n(10) %>%
  ungroup %>%
  mutate(term=reorder(term,beta)) %>%
  ggplot(aes(term,beta,fill=topic))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic,scales="free")+
  coord_flip()+
  labs(title="Contribution of each Word to the Topics",
       y="Proportionality",
       x="Term")+
  scale_fill_gradient(low="#E69F00", high="#56B4E9")

The final graphic is a facet plot depicting how much each book contributes to each topic. As you can see, the topic model determined that each book contributes to one topic. American Indian Stories contributes to Topic 2 and Old Indian Legends contributes to Topic 1. The topics are determined by the word use in each of the texts.

tidy_gamma <- tidy(topic_model,matrix = "gamma",
                   document_names = rownames(topic_df))

ggplot(tidy_gamma,aes(gamma,fill=as.factor(document)))+
  geom_histogram()+
  facet_wrap(~topic)+
  labs(title="Contribution of each Book to the Topics",
       x="Topic Contribution",
       y="Count",
       fill="Book")+
  scale_fill_manual(values=c("#56B4E9","#E69F00"))

Conclusions

The exploratory analysis of the texts written by Zitkala-sa showed there were differences between Old Indian Legends and American Indian Stories. Exploring the word frequency in each book showed that Old Indian Legends primarily used words that are associated with nature and American Indian Stories used words associated with family or day-to-day lives in America. The unsupervised topic model analysis demonstrated that there are two distinct topics. Old Indian Legends contributes to Topic 1 and American Indian Stories contributes to Topic 2. The words used in each topic are consistent with the initial word frequency calculations. The topic model analysis also showed Topic 1 consisted of words associated with nature and Topic 2 consisted of words associated with family and humans.

Recommendations

The texts written by Zitkala-sa could have similar themes to other Native American writers of the time. We recommend comparing her books to other writers at the time and more recent works by Native Americans. This will compare the themes from the 2000s and the 1900s. The follow on analysis will show how themes might have changed throughout history or have remained the same.

Technical Notes

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] quanteda_3.1.0   stm_1.3.6        tidytext_0.3.0   forcats_0.5.1   
##  [5] stringr_1.4.0    dplyr_1.0.4      purrr_0.3.4      readr_1.4.0     
##  [9] tidyr_1.1.2      tibble_3.0.6     ggplot2_3.3.3    tidyverse_1.3.0 
## [13] gutenbergr_0.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6         lubridate_1.7.9.2  lattice_0.20-41    assertthat_0.2.1  
##  [5] digest_0.6.27      plyr_1.8.6         R6_2.5.0           cellranger_1.1.0  
##  [9] backports_1.2.1    reprex_1.0.0       evaluate_0.14      highr_0.8         
## [13] httr_1.4.2         pillar_1.4.7       rlang_0.4.10       curl_4.3          
## [17] readxl_1.3.1       rstudioapi_0.13    data.table_1.13.6  Matrix_1.2-18     
## [21] rmarkdown_2.6      labeling_0.4.2     urltools_1.7.3     triebeard_0.3.0   
## [25] munsell_0.5.0      broom_0.7.4        compiler_4.0.3     janeaustenr_0.1.5 
## [29] modelr_0.1.8       xfun_0.20          pkgconfig_2.0.3    htmltools_0.5.1.1 
## [33] tidyselect_1.1.0   crayon_1.4.0       dbplyr_2.1.0       withr_2.4.1       
## [37] SnowballC_0.7.0    grid_4.0.3         jsonlite_1.7.2     gtable_0.3.0      
## [41] lifecycle_0.2.0    DBI_1.1.1          magrittr_2.0.1     scales_1.1.1      
## [45] tokenizers_0.2.1   RcppParallel_5.1.4 cli_2.3.0          stringi_1.5.3     
## [49] reshape2_1.4.4     farver_2.0.3       fs_1.5.0           xml2_1.3.2        
## [53] ellipsis_0.3.1     stopwords_2.2      generics_0.1.0     vctrs_0.3.6       
## [57] fastmatch_1.1-3    tools_4.0.3        glue_1.4.2         hms_1.0.0         
## [61] yaml_2.2.1         colorspace_2.0-0   rvest_0.3.6        knitr_1.31        
## [65] haven_2.3.1