CS695 Week 8 Notebook: Zynga Analysis

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

Install necessary packages. Comment after installation

#install.packages('tm')
#install.packages('RColorBrewer')
#install.packages('wordcloud')
# installed.packages('tidytext')
# installed.packages('dplyr')
# install.packages("readr")
# install.packages("plyr")
# install.packages("stringr")
# install.packages("stringi")
# install.packages('plotly')

Include the packages.

library('tm')

## Loading required package: NLP

library('RColorBrewer')
library('wordcloud')
library('readr')
library('tidytext')
library('dplyr')

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library('plyr')

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library('stringr')
library('stringi')
library('plotly')

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Process data

ZyngaData <- readRDS("Zynga.RDS")
Zynga <- readRDS("Zynga.RDS")
tweets <- ZyngaData$text

# Read dictionaries
money.words = scan('moneyWords.txt', what='character', comment.char=';')
fear.words = scan('fearWords.txt', what='character', comment.char=';')
pos.words = scan('positive-words.txt', what='character', comment.char=';')
neg.words = scan('negative-words.txt', what='character', comment.char=';')

# swap out all non-alphanumeric characters
# Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
# str_replace_all(tweets, "[^[:alnum:]]", " ")
# iconv(tweets, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# Encoding(tweets)  <- "UTF-8"

# Function to clean tweets
clean.text = function(x)
{
  # remove rt
  x = gsub("rt", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  # remove punctuation
  x = gsub("[[:punct:]]", "", x)
  # remove numbers
  x = gsub("[[:digit:]]", "", x)
  # remove links http
  x = gsub("http\\w+", "", x)
  # remove tabs
  x = gsub("[ |\t]{2,}", "", x)
  # remove blank spaces at the beginning
  x = gsub("^ ", "", x)
  # remove blank spaces at the end
  x = gsub(" $", "", x)
  # tolower
  #x = tolower(x)
  return(x)
}

# clean tweets
tweets = clean.text(tweets)

Create word cloud of tweets

corpus = Corpus(VectorSource(tweets))

# create term-document matrix
tdm = TermDocumentMatrix(
  corpus,
  control = list(
    wordLengths=c(3,20),
    removePunctuation = TRUE,
    stopwords = c("the", "a", stopwords("english")),
    removeNumbers = TRUE, 
  # tolower may cause trouble on Window because UTF-8 encoding, changed to FALSE  
    tolower = FALSE) )

# convert as matrix. It may consume near 1g of your RAM
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE) 

#check top 50 most mentioned words
head(word_freqs, 50)

##          Zynga          needs          Poker           just           card 
##            890            591            545            500            469 
##           sent         raised        Deborah          Scott        looking 
##            469            401            311            264            257 
##                       Prized            can          Petra          adult 
##            220            219            191            187            186 
##         Jeneva          found            now          trees             s 
##            174            166            147            147            146 
##           Play            How            car            bit         Points 
##            145            143            141            140            139 
##        rewards          video    sponsorship        needing          shook 
##            139            139            139            138            138 
##       RTHeres         gotas          Betty          Fruit        Kathryn 
##            137            137            125            113            112 
##            The         Career FarmVilleOnWeb         Spring         County 
##            110            105            105            104            104 
##            get         Corner           Game            use            Hat 
##            100            100             97             92             92 
##          Black            You           King            win          Check 
##             91             88             88             87             86

#remove the top words which do not generate insights such as "the", "a", "and", etc.
#word_freqs = word_freqs[-(1:5)] Can uncomment if top 5 words are not useful  
#Here [-(1:5)] is 1st-5th words in the list we want to remove 

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot corpus in a clored graph; need RColorBrewer package

wordcloud(head(dm$word, 50), head(dm$freq, 50), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

#check top 50 most mentioned words
head(word_freqs, 50)

##          Zynga          needs          Poker           just           card 
##            890            591            545            500            469 
##           sent         raised        Deborah          Scott        looking 
##            469            401            311            264            257 
##                       Prized            can          Petra          adult 
##            220            219            191            187            186 
##         Jeneva          found            now          trees             s 
##            174            166            147            147            146 
##           Play            How            car            bit         Points 
##            145            143            141            140            139 
##        rewards          video    sponsorship        needing          shook 
##            139            139            139            138            138 
##       RTHeres         gotas          Betty          Fruit        Kathryn 
##            137            137            125            113            112 
##            The         Career FarmVilleOnWeb         Spring         County 
##            110            105            105            104            104 
##            get         Corner           Game            use            Hat 
##            100            100             97             92             92 
##          Black            You           King            win          Check 
##             91             88             88             87             86

Now we want a histogram of sentiment analysis for this data

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  
  # we got a vector of sentences. plyr will handle a list
  # or a vector as an "l" for us
  # we want a simple array of scores back, so we use
  # "l" + "a" + "ply" = "laply":
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    #sentence = tolower(sentence)
    
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    
    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    
    return(score)
  }, pos.words, neg.words, .progress=.progress )
  
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

sentiment.scores= score.sentiment(tweets, pos.words, neg.words, .progress='none')

score <- sentiment.scores$score
p <- plot_ly(x = ~score, type = "histogram")
p

Histogram by weekday

#save(Zynga, file = "Zynga.RDS")
#load(file = "Zynga.RDS")
Zynga$days <- weekdays(as.POSIXlt(Zynga$created))
dfrm <-data.frame(table(Zynga[,"days"]))

#line 188 added to reorder plot of Var1 in dfrm
dfrm$Var1 <- factor(dfrm$Var1, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
p <- plot_ly(dfrm, x = ~Var1, y = ~Freq, type = 'bar', name = 'Monday') %>%

#lines 192 - 197 redundant, have been omitted
#  add_trace(y = ~Freq, name = 'Tuesday') %>%
#  add_trace(y = ~Freq, name = 'Wednesday') %>%
#  add_trace(y = ~Freq, name = 'Thursday') %>%
#  add_trace(y = ~Freq, name = 'Friday') %>%
#  add_trace(y = ~Freq, name = 'Saturday') %>%
#  add_trace(y = ~Freq, name = 'Sunday') %>%
  layout(yaxis = list(title = 'Count'))

p

NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA

Results:

#Here I will create several grep results that may give further insight into certain terms for Zynga
#Words like needs and raised are posted by the Zynga apps themselves by users playing the games
#Similarly, individuals with common names appear in the data since the apps sometimes post updates including their first names
#By searching for the term "Zynga", some insights are gathered. Looking at the english comments there appear to be issues with glitches (quality) that users find 'annoying'. Now I will search for the annoyance words (annoying, issue) to see if anything else comes up."

index = grep("issue", tweets)
tweets[index]

##  [1] "RT<U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E31><U+383C><U+3E37> Impoant to remember these are longstanding issues from Zynga games to Targeted sharing"
##  [2] "Thanks for the suppo Rest assured that the team is doing their best to address the issues in the ga"              
##  [3] "<U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E31><U+383C><U+3E37> Impoant to remember these are longstanding issues from Zynga games to Targeted sharing"
##  [4] "This is an an issue and our team is currently working on a fix Please contact them to fuher check"                
##  [5] "I as well have no issues giving you the names of my managers whom worked inhouse at the SF HQ I can as"           
##  [6] "This is a known issue which I came to know when I went to Zyngas blog But unfounately there is"                  
##  [7] "If the issue still persists after following the instructions mentioned please contact Suppo "                     
##  [8] "This is an ongoing issue and our team is currently checking on this For fuher announcements pleas"                
##  [9] "No just that it became an issue for indie devs and businesses at ceain volumes\n\nIm sure Zynga and F"            
## [10] "We appreciate your feedback as well as your repo about the connection The server issue is currently"              
## [11] "Were currently experiencing server issues Please be on the lookoutfor fuher"                                      
## [12] "RTFVCE Howdy Farmers Were glad to inform you that the issues with loading or games reveing back to levelhave been"
## [13] "RTEDITFree Favors issue has been fixed Please take the survey and claim yourFree Favors <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E63><U+653C><U+3E64><U+623C><U+3E65><U+383C><U+3E31>"
## [14] "This issue is being looked intoPlease stay tuned for updates by visiting the Hot Top"                             
## [15] "FVCE Howdy Farmers Were glad to inform you that the issues with loading or games reveing back to levelhave"       
## [16] "Its a known issue that the team is looking into You may check here <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E31><U+383C><U+3E39>for"
## [17] "We apologize for the inconvenience that this Live Race connection issue has brought Rest assured that th"         
## [18] "Sorry for the inconvenience Rest assured that the team is constantly working on the games issues"                 
## [19] "Our apologies for the inconvenience Both issues you have mentioned are currently being narrowed down"             
## [20] "The connection issue in Live Races is currently being worked on by the team This post <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E31><U+383C><U+3E39>"
## [21] "HiThis issue has been fixed with the recent update Make sure that your game has the l"                            
## [22] "This experience is still an ongoing issue that our team is currently working on Fuher details w"

#**grep searches for Zynga/names/certain words have been omitted since they are very long, I summarized my findings above in lines 208-210

#It looks like if Zynga focused a bit more on quality control and maintenance they could maintain a high userbase. Many people already play their games, and judging by the wordcloud results, enjoy the sharing aspect of their games with their friends. Along with focus on quality and maintenance, improving and expanding on the social aspects of their titles would likely be positively recieved, attracting more users. Another idea would be implementing game benefits by day of the week if the company wanted to improve user play statistics across the week.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

CS695 Week 8 Notebook: Zynga Analysis

Rowen Darrell