Introduction

It’s time to brush up on my skills and go back to the best data set, squirrels and their stories.

Reading in the Data

First lets start with the easy part. Reading in the data on our best friends the squirrels. While there are many guides I choose this1 one.

path <- "https://data.cityofnewyork.us/resource/gfqj-f768.json "

request <- GET(url = path)

response <- content(request, as = "text", encoding = "UTF-8")

squirrels <- fromJSON(response, flatten = TRUE) %>% 
  data.frame()

datatable(squirrels, class = 'table-bordered',
          caption = 'Full Squirrel Tales Table',
          width = '100%', options = list(scrollX = TRUE, pageLength = 2))

Something interesting I noticed is when I read this in through API the column headers put underscores where spaces were. Since they are all this way I’ll leave them as they are.

Squirrels and Other Animals

An interesting column I noticed was if there was another animal with the squirrels they were recorded. This made me curious to see what types of animals there were.

sq.others <- select(squirrels, shift,
             story_topic_other_animals, story_topic_other)

sq.others <- sq.others %>%
        filter(!across(c(story_topic_other_animals, story_topic_other), ~ is.na(.))) %>% 
        replace_na(list(story_topic_other = 'Unknown')) %>%
        group_by(story_topic_other)%>%
        mutate(count=n())

datatable(squirrels, class = 'table-bordered',
          caption = 'Squirrels and Other Animals',
          width = '100%', options = list(scrollX = TRUE, pageLength = 2))

Since I was recommended to try to do circles by how often they show I’m going to attempt to do it with this data.

p1 <- ggplot(sq.others, aes(x=shift, y=story_topic_other, 
                           color=story_topic_other)) + 
    geom_point() +
    labs(title = "Squirrel's with Other Animals by AM or PM",
       x = "AM or PM", y= "Other Animals",
       color = "Other Animals") +
   theme_minimal() +
 theme(legend.key.size = unit(.5, 'mm'),legend.key.width = unit(.5,"mm"), 
       legend.position = "none")

p1 + geom_point(aes(size = count))

As we can see both in the AM and PM birds were the most seen animal followed by many volunteers not seeing what the other animal was or not writing it down. Interestingly enough someone did see a person walking a cat on a leash and raccoons at night at the same time.

Squirrels don’t like weddings.

Based off one of the best comments, I’ll now go ahead and see the likelihood of all of these story topics.

sq.story <- squirrels %>%
  select(shift, story_topic_squirrel, story_topic_park_experience, 
         story_topic_accidental_poems, story_topic_other_animals, 
         story_topic_census_takers, story_topic_squirrels_acting, 
         story_topic_dogs) %>%
pivot_longer(!shift,
              names_to = "stories",
              values_to = "True",
              values_drop_na = TRUE) %>%
   mutate_at("stories", str_replace,"story_topic_", "")

p2 <- ggplot(sq.story, aes(x=shift, fill=stories)) +
    geom_bar(position=position_dodge()) + 
  labs(title = "Squirrel Stories during AM or PM", 
       caption = "Squirrels may have more than one story at a time", 
       x = "AM or PM", y= "count") +
   theme_minimal() +
 theme(legend.key.size = unit(3, 'mm'),legend.key.width = unit(7,"mm"), 
       legend.position="bottom") + 
  scale_fill_brewer(palette = "Set3") 

p2

Here we can see volunteers spending a lot of time writing about the park experience and the squirrel themselves no matter what time of day. However interestingly while dogs were heavily written about in the AM by PM they dropped below volunteers writing about other animals.

My favorite column is the accidental poems which are described as “This tag indicates a note from volunteer that read, to the Census team, like short poems about the park.”

Testing Text Analysis

Disclaimer

I fully expect this will not work well so I’m just going to try my best.

Corpus

Next I’m going to try to make a corpus file. This is a file that stores textual data that is used throughout linguistics and text analysis. I will attempt this using the tm package. I’m using a guide2 to test this. Below you will see the first corpus in the table for descriptions.

corpus <- SimpleCorpus(VectorSource(squirrels$note_squirrel_park_stories))

strwrap(corpus[[1]])
## [1] "Observed a squirrel with a cache of peanuts that he was eating."  
## [2] "Strangely, none of the other squirrels were eating those peanuts."

If you are more interested in what’s in a corpus I highly recommend reading the article! Now that we had a simple corpus I’m going to follow the next steps.

# 1. Stripping any extra white space:
corpus <- tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents
# 2. Transforming everything to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
# 3. Removing numbers 
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
# 4. Removing punctuation
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
# 5. Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

If you’re wondering what stop words are they are common words that have little value for most text analysis. You will see below some of the popular ones in English.

stopwords("english")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Now we can see revisit the first description from above to make sure it works.

strwrap(corpus[[1]])
## [1] "observed squirrel cache peanuts eating strangely none squirrels eating"
## [2] "peanuts"

For the Word Cloud

Before I stem I’m going to make another version for the word cloud.

corpus1 <- corpus
DTM1 <- DocumentTermMatrix(corpus1)

Stemming

Now I’m focusing on “stemming”. I wish I could explain it better but I don’t understand what “collapsing words to a common root, which helps in the comparison and analysis of vocabulary” really means. My best guess is it’s just trying to simplify things for easier understanding.

corpus <- tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
strwrap(corpus[[1]])
## [1] "observ squirrel cach peanut eat strang none squirrel eat peanut"

I’m not sure that helped…But I’ll keep trying.

Creating a Docoment-Term Matrix (DTM)

Now we’ll use the corpus data to create a matrix to better understand the data. Each row will be a unique document and each column will be a unique term. Per the guide this matrix is stored as a “type = simple_triplet_matrix” that’s more effective in storing data.

DTM <- DocumentTermMatrix(corpus)
inspect(DTM)
## <<DocumentTermMatrix (documents: 809, terms: 2993)>>
## Non-/sparse entries: 15046/2406291
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs   — area bird dog lot one peopl saw squirrel tree
##   173 13    2    0   0   0   0     0   0        3    1
##   198  1    0    0   1   0   1     1   0        2    2
##   325 19    0    1   1   0   0     0   0        5    0
##   33   0    0    2   0   0   0     1   0        3    2
##   368  5    1    1   2   0   0     1   0        4    1
##   420 14    2    0   1   0   0     0   0        4    2
##   65   0    1    0   4   0   0     0   0       14    3
##   701  0    0    0   0   0   0     0   0        0    2
##   790  4    0    0   0   0   0     0   1        1    0
##   81   0    2    0   0   1   1     0   2       11    0

Here we see we have 809 documents with 2,993 terms. We also can see some of the most frequent terms.

Word Clouds

Here are word clouds that summarize the top 75 words.

sums <- as.data.frame(colSums(as.matrix(DTM1)))
sums <- rownames_to_column(sums) 
colnames(sums) <- c("term", "count")
sums <- arrange(sums, desc(count))
head <- sums[1:75,]
wordcloud(words = head$term, freq = head$count, min.freq = 1000,
  max.words=100, random.order=FALSE,
  colors=brewer.pal(8, "Pastel1"))

To no one’s surprise we can see squirrel and squirrels are the most used word followed by tree. However what’s interesting is somehow in the top 75 words asked is there. This is the unstemmed word cloud. Below you will see the stemmed version.

sums <- as.data.frame(colSums(as.matrix(DTM)))
sums <- rownames_to_column(sums) 
colnames(sums) <- c("term", "count")
sums <- arrange(sums, desc(count))
head <- sums[1:75,]
wordcloud(words = head$term, freq = head$count, min.freq = 1000,
  max.words=100, random.order=FALSE,
  colors=brewer.pal(8, "Pastel1"))

Here we see something a bit different, squirrel is still the top word however squirrels was taken out. Additionally dog seems to appear more.

Sentiment Analysis

Now we’re going to see if we can see how positive or negative theses reviews are. The SentimentAnalysis package uses Harvard-IV dictionary ( General Inquirer) which is a dictionary of words associated with positive (1,915 words) or negative (2,291 words) sentiment.

sent <- analyzeSentiment(DTM, language = "english")
# were going to just select the Harvard-IV dictionary results ..  
sent <- sent[,1:4]
#Organizing it as a dataframe
sent <- as.data.frame(sent)
# Now lets take a look at what these sentiment values look like. 
head(sent)
##   WordCount SentimentGI NegativityGI PositivityGI
## 1        10  0.00000000   0.00000000   0.00000000
## 2        33 -0.03030303   0.03030303   0.00000000
## 3        22 -0.09090909   0.09090909   0.00000000
## 4        19  0.10526316   0.05263158   0.15789474
## 5        35 -0.11428571   0.25714286   0.14285714
## 6        24 -0.12500000   0.16666667   0.04166667

Although I had a feeling it wouldn’t be a positive score due to the words that used most with squirrels are not positive, it was interesting to see the higher word counts still had much more negative words than positive.

Lets look at a summary overall.

summary(sent$SentimentGI)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.04651  0.00000  0.01490  0.08696  0.50000

Now just for fun we’re going to look at the how the words associate emotions with positive or negative using the syuzhet package using the NRC emotion lexicon.

sent2 <- get_nrc_sentiment(squirrels$note_squirrel_park_stories)

# Let's look at the corpus as a whole again:
sent3 <- as.data.frame(colSums(sent2))
sent3 <- rownames_to_column(sent3) 
colnames(sent3) <- c("emotion", "count")

ggplot(sent3, aes(x = emotion, y = count, fill = emotion)) + 
  geom_bar(stat = "identity") + 
  theme_minimal() + 
  theme(legend.position="none", panel.grid.major = element_blank()) + 
  labs( x = "Emotion", y = "Total Count") + 
  ggtitle("Sentiment of Note takers on Squirrels") + 
  theme(plot.title = element_text(hjust=0.5))

What’s interesting here is according to this there were many positive emotions on taking notes on squirrels along with surprise, trust, and joy. Seeing so many squirrels must have brought a lot of joy to people’s lives.