Text Mining Kickstarter Projects

if(!("devtools" %in% installed.packages())) install.packages("devtools")
if("srhoads" %in% installed.packages()) library(srhoads) else devtools::install_github("srhoads/srhoads")
pkg("shiny")
pkg("tidyverse")

Overview

Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowd funding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life”.

Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

For this assignment, I am asking you to analyze the descriptions of kickstarter projects to identify commonalities of successful (and unsuccessful projects) using the text mining techniques we covered in the past two lectures.

Data

The dataset for this assignment is taken from webroboto.io ‘s repository. They developed a scrapper robot that crawls all Kickstarter projects monthly since 2009. We will just take data from the most recent crawl on 2018-02-15.

To simplify your task, I have downloaded the files and partially cleaned the scraped data. In particular, I converted several JSON columns, corrected some obvious data issues, and removed some variables that are not of interest (or missing frequently). I have also subsetted the data to only contain projects originating in the United States (to have only English language and USD denominated projects).

The data is contained in the file kickstarter_projects.csv and contains about 150,000 projects and about 20 variables.

Tasks for the Assignment

1. Identifying Successful Projects

Below, I’m reading the data straight from github.

# d <- read.csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/09_kickstarter/kickstarter_projects.csv?token=AZbwLuzRYxYEP2HM2U0W6xLJR6ZeCw8-ks5ayrITwA%3D%3D")

d <- tryCatch(read.csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/09_kickstarter/kickstarter_projects.csv?token=AZbwLuzRYxYEP2HM2U0W6xLJR6ZeCw8-ks5ayrITwA%3D%3D"), error=function(e){
  tryCatch(read.csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/09_kickstarter/kickstarter_projects.csv?token=AGLPALRW2PWE4X2PF4ZZSGC5AR3WK"), 
           error=function(e){
             tryCatch(get(load("kickstarter_projects.csv.rda")), error=function(e) NULL)
           })
})

Interestingly enough, it seems that .rda (RData) files are smaller than .f (feather) files.

# kickstarter_projects <- d
loadfromrda = T # loadfromrda = F
subsetsmall = T
if(loadfromrda) d <- get(load("kickstarter_projects.csv.rda"))
if(subsetsmall) d <- d %>% sample_n(., 5000)
save_kickstarter_projects = F
if(save_kickstarter_projects) save(kickstarter_projects, file="kickstarter_projects.csv.rda") # feather::write_feather(kickstarter_projects, "kickstarter_projects.csv.f")

if(!exists("d")|tryCatch(is.null(d), error=function(e) T)) d <- data.frame(
  backers_count = c(51L, 26L, 89L, 41L, 11L),
  converted_pledged_amount = c(1536L, 1616L, 3700L, 1510L, 1054L),
  goal = c(1000, 1500, 500, 1200, 2957),
  id = c(747214266L, 167852290L, 954681482L, 219527796L,
         1832299147L),
  pledged = c(1536.01, 1616, 3700, 1510, 1054),
  blurb = as.factor(c("Fancy Ketchup seeks the help of its loyal fan base to raise enough money to fund its followup to its first album,
                                          \"Hold the Mayo.\"",
                      "THE PLATH PROJECT: TWO WORLD PREMIERES\n@ The Center for New Music, San Francisco",
                      "Quality handmade pens made from exotic hardwoods and other elegant materials",
                      "\"The Dracula Letters\" is the latest project by composer S.J. Pettersson featuring famed mezzo soprano Iris Malkin.",
                      "Powerfully Healing Perspective... Learn how flipping the current pain scale empowers your body/mind to rewire faulty brain programming!")),
  country = as.factor(c("USA", "USA", "USA", "USA", "USA")),
  created_at = as.factor(c("2013-02-06", "2014-10-09",
                           "2012-09-29", "2014-10-02",
                           "2018-01-21")),
  currency = as.factor(c("USD", "USD", "USD", "USD", "USD")),
  deadline = as.factor(c("2013-03-15", "2014-12-24",
                         "2012-10-30", "2014-11-05",
                         "2018-02-23")),
  is_starrable = as.factor(c("false", "false", "false", "false",
                             "true")),
  launched_at = as.factor(c("2013-02-13", "2014-10-25",
                            "2012-10-10", "2014-10-06",
                            "2018-01-24")),
  name = as.factor(c("Fancy Ketchup's Second Album",
                     "The Plath Project",
                     "Handcrafted Pens Made from Exotic Woods",
                     "\"The Dracula Letters\" -  by S.J. Pettersson",
                     "Comfort Quest- HEALING Pain through a Transformative Lens")),
  slug = as.factor(c("fancy-ketchups-second-album",
                     "the-plath-project",
                     "handcrafted-pens-made-from-exotic-woods",
                     "the-dracula-letters-by-sj-pettersson",
                     "comfort-quest-healing-pain-through-a-transformativ")),
  source_url = as.factor(c("https://www.kickstarter.com/discover/categories/music/rock",
                           "https://www.kickstarter.com/discover/categories/music/classical%20music",
                           "https://www.kickstarter.com/discover/categories/crafts",
                           "https://www.kickstarter.com/discover/categories/music/classical%20music",
                           "https://www.kickstarter.com/discover/categories/publishing/nonfiction")),
  spotlight = as.factor(c("true", "true", "true", "true",
                          "false")),
  staff_pick = as.factor(c("false", "true", "false", "true",
                           "false")),
  state = as.factor(c("successful", "successful",
                      "successful", "successful", "live")),
  state_changed_at = as.factor(c("2013-03-15", "2014-12-24",
                                 "2012-10-30", "2014-11-05",
                                 "2018-01-24")),
  location_town = as.factor(c("Los Angeles", "San Francisco",
                              "Tremonton", "Los Angeles",
                              "Jacksonville")),
  location_state = as.factor(c("CA", "CA", "UT", "CA", "FL")),
  top_category = as.factor(c("music", "music", "crafts", "music",
                             "publishing")),
  sub_category = as.factor(c("rock", "classical music", NA,
                             "classical music", "nonfiction"))
)

2. Writing your success story

Each project contains a blurb – a short description of the project. While not the full description of the project, the short headline is arguably important for inducing interest in the project (and ultimately popularity and success). Let’s analyze the text.

a) Cleaning the Text and Word Cloud

To reduce the time for analysis, select the 1000 most successful projects and a sample of 1000 unsuccessful projects. Use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space etc. Note, that many projects use their own unique brand names in upper cases, so try to remove these fully capitalized words as well (since we are aiming to identify common words across descriptions). Stem the words left and complete the stems. Create a document-term-matrix.

library(tidyverse)
library(tidytext)
library(stringr)
library(tm)
library(topicmodels)
library(SnowballC)

First, we read in the documents in the corpus. Then, we do some text processing!

txt <- d

txt$Blurb1 <- iconv(txt$blurb, "latin1", "UTF-8", sub='')

blurb.1.corp <- Corpus(VectorSource(txt$Blurb1)) # first five blurbs in blurb type 1

(corpus <- blurb.1.corp)

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5000

Below, I’m removing white-space, removing punctuation, changing the text to lowercase text, removing numbers, removing stop-words like “and” and “but”, and stemming the documents so that words can be interpreted regardless of tense or part of speech. This is important for using text mining dictionaries efficiently and gathering meaning efficiently. It’d be ridiculous to have to match every tense and every form and part of speech of every word. Stemming them gets to the root of what we want.

corpus <- tm_map(corpus, stripWhitespace) # removing whitespace
corpus <- tm_map(corpus, removePunctuation) # removing punctuation
corpus <- tm_map(corpus, content_transformer(tolower)) # making lowercase for consistency
corpus <- tm_map(corpus, removeNumbers) # removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("english")) # removing English stopwords like transition words and articles
cleaned_corpus <- tm_map(corpus, stemDocument) # truncating words for broader meanings
# save(corpus, file = "cleaned_corpus.rda")
# load("cleaned_corpus.rda")

Since there are three conditions in which students were placed, so there should be different essay themes depending upon those conditions. Lettuce see!

# essay 1
dtm <- tm::DocumentTermMatrix(corpus)

Generating relative word frequencies for each bag of words, and - comparing them to each other.

Below I’m just playing with the functions I took the time to learn about. It’s pretty cool what you can extrapolate quantitatively. The words stress and Columbia dominate in terms of most used.

# essay 1
findFreqTerms(dtm, lowfreq = 10000)  # finding words w/ frequency: 10000

character(0)

findFreqTerms(dtm, lowfreq = 9000) # finding words w/ frequency: 9000

character(0)

findFreqTerms(dtm, lowfreq = 8000)  # finding words w/ frequency: 8000

character(0)

findFreqTerms(dtm, lowfreq = 7000)  # finding words w/ frequency: 7000

character(0)

findFreqTerms(dtm, lowfreq = 6000)  # finding words w/ frequency: 6000

character(0)

findFreqTerms(dtm, lowfreq = 5000)  # finding words w/ frequency: 5000

character(0)

findFreqTerms(dtm, lowfreq = 4000)  # finding words w/ frequency: 4000

character(0)

findFreqTerms(dtm, lowfreq = 3000)  # finding words w/ frequency: 3000

character(0)

findFreqTerms(dtm, lowfreq = 2000)  # finding words w/ frequency: 2000

character(0)

For Essay 1: According to the word frequency calls I ran above, there are only two words that appear at least 75 times. Those two words are columbia and stress. That’s sad, but not surprising of Columbia students talking about their Columbia experience. The word stems “feel”, “studi”, and “time” are the only words that appear at least 50 times. “Stress” is actually used at least

For Essay 1: The only word used more than 60 times is “stress.” Stress is used at least 100 times. In reality, it’s used 110 times. This is a similar phenomenon to the first essay. Some of the other frequently used words include school-related/work-related/social-life-related words. I’m surprised some words like “can” weren’t removed as stop words. I guess this carries some meaning.

White these sets of blurbs are similar, they have their differences in terms of the words used. Essay 2 almost seem like it shows more hope than essay one, but essay 1 does still have a lot of the same hopeful words. Maybe there are some themes of hopefulness in each, while also both containing themes of negativity due to stress. The theory is that having a shared identity should breed hopefulness and inclusion and elicit feelings of positivity, but based on the word frequencies, I can’t exactly see any strong themes of shared identity.

# essay 1
findAssocs(dtm, "world", corlimit = 0.08)  # finding words associated with columbia!

$world
      cricket       whizzys        honeys       overrun        purest 
         0.13          0.13          0.13          0.13          0.13 
       belief    population decentralized      partying      tomorrow 
         0.12          0.11          0.11          0.11          0.11 
       indian        around      premiere     countries         grows 
         0.10          0.09          0.09          0.09          0.09 
   impossible        target         swore     predators        aboard 
         0.09          0.09          0.09          0.09          0.09 
     sponsors        groups       tension          odds      restored 
         0.09          0.09          0.09          0.09          0.09 
   playwright 
         0.09

# findAssocs(dtm, "music", corlimit = 0.1)  # finding words associated with stress!
# findAssocs(dtm, "help", corlimit = 0.1)  # finding words associated with studi!

Some of the words associated with the common essay one words are shown above. The words like “person”, “culture”, “realize”, and “experience” correlate with Columbia. It’s interesting– these are nice words, not words like “stress” or “overwhelm,” The second essay is more ominous. The words that correlate with Columbia include “damage”, “senseless”, and “unhealthy.” I should explore what drives the increased negativity toward Columbia in the Essay 2 blurbs.

I’m plotting my `DocumentTermMatrix` below. It’s reminiscent of a social network. The connected words basically are shown as nodes connected by ties.

set.seed(1110)

if(!('Rgraphviz' %in% installed.packages())) {source("http://bioconductor.org/biocLite.R"); biocLite("Rgraphviz")}

plot(dtm, terms = findFreqTerms(dtm, lowfreq = 100),
     corThreshold = 0.10, cex = 15) # only plotting words with frequencies and have a correlation coeff above .10

Let’s visualize ord cloud of the most frequent or important words (your choice which frequency measure you choose) among the most successful projects.

library(wordcloud)

dim(top50usa <- head(usa.success <- d %>% dplyr::select(everything()) %>%
                       subset(country == "USA") %>%
                       subset(state == "successful") %>%
                       subset(backers_count > mean(backers_count, na.rm = T) + sd(backers_count, na.rm = T)*2) %>%
                       arrange(desc(backers_count)), 50))

[1] 50 23

dim(usa.success)

[1] 53 23

usa.success$type <- "successful"

dim(low50usa <- head(usa.failed <- d %>% select(everything()) %>%
                       subset(country == "USA") %>%
                       subset(state == "failed") %>%
                       subset(backers_count > mean(backers_count, na.rm = T) + sd(backers_count, na.rm = T)*2) %>%
                       arrange((backers_count))))

[1]  6 23

usa.failed$type <- "unsuccessful"
dim(usa.failed)

[1] 35 24

dat.suc.fail <- rbind(usa.success, usa.failed)

# head(dat.suc.fail$document <- dat.suc.fail$type) %>% DT::datatable(zapsmall())

dat.suc.fail$document <- dat.suc.fail$type

head(dat.suc.fail, 50) %>% #zapsmall() %>% 
  DT::datatable()

dat.suc.fail$ID <- seq.int(nrow(dat.suc.fail))

txt <- dat.suc.fail

txt$blurb <- iconv(txt$blurb, "latin1", "UTF-8", sub='')

corpus <- Corpus(VectorSource(txt$blurb))

corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))

for (i in seq(corpus)) {
  corpus[[i]] <- gsub('[^a-zA-Z|[:blank:]]', "", corpus[[i]])
} # removing non-English characters

clean <- corpus

stemmed <- tm_map(clean, stemDocument, language="english")

stemCompletion2 <- function(x, dictionary) {
  x <- unlist(strsplit(as.character(x), " "))
  x <- x[x != ""]
  x <- stemCompletion(x, dictionary=dictionary)
  x <- paste(x, sep="", collapse=" ")
  PlainTextDocument(stripWhitespace(x))
}

complete <- parallel::mclapply(stemmed, stemCompletion2, dictionary=clean)

# attach metadata to corpus
for (i in 1:dim(dat.suc.fail)[1]){
  complete[[i]]$meta$id <- dat.suc.fail[i,"type"]
}

tb1000_complete  <- as.VCorpus(complete)
dtm <- DocumentTermMatrix(tb1000_complete)
dtm.mat <- as.matrix(dtm)
head(dtm.mat[,1:10]) # check if data is well processed

            Terms
Docs         able accessible acting action active advanced adventure
  successful    0          0      0      0      0        0         0
  successful    0          0      0      0      0        0         0
  successful    0          0      0      0      0        0         0
  successful    0          0      0      0      0        0         0
  successful    0          0      0      0      0        0         0
  successful    0          0      0      0      0        0         1
            Terms
Docs         adventurequest adventurerpg affordable
  successful              0            0          0
  successful              0            0          0
  successful              0            0          0
  successful              0            0          0
  successful              0            0          0
  successful              0            0          0

# Turn into data frame
extremes.df <- as.data.frame(tidy(dtm))
head(extremes.df)

    document    term count
1 successful  battle     1
2 successful   build     1
3 successful crafted     1
4 successful    epic     1
5 successful    game     1
6 successful sandbox     1

extremes.df %>% filter(document=="successful") %>% subset(., select=c("term","count")) %>%
  group_by(term) %>% summarise(total_count = sum(count)) %>%
  arrange(-total_count) -> successful

extremes.df %>% filter(document=="unsuccessful") %>% subset(., select=c("term","count")) %>%
  group_by(term) %>% summarise(total_count = sum(count)) %>%
  arrange(-total_count) -> unsuccessful

set.seed(13020) # to make word cloud makeable again

# Create a wordcloud
pal1 = brewer.pal(9, "Greens")
pal2 = brewer.pal(9, "Reds")
wordcloud(successful$term, successful$total_count, max.words=100, scale=c(3.5, 0.3), colors=pal1)

wordcloud(unsuccessful$term, unsuccessful$total_count, max.words=100, scale=c(3.5, 0.3), colors=pal2)

library(plotrix)
successful$type <- "successful"
unsuccessful$type <- "unsuccessful"
rbind(successful, unsuccessful) %>% spread(., key=type, value=total_count) %>%
  filter(!is.na(.$successful) & !is.na(.$unsuccessful)) %>% as.data.frame(.) %>%
  mutate(difference = abs(.$successful - .$unsuccessful)) %>% arrange(-difference) -> commonwords

head(commonwords)

     term successful unsuccessful difference
1     new         15            1         14
2    game         12            1         11
3    card          1            5          4
4     let          5            1          4
5 bicycle          1            4          3
6   build          4            1          3

top20 <- commonwords[1:20,]
scol <- brewer.pal(n=9, name="Greens")
uscol <- brewer.pal(n=9, name="Reds")

(pyramid.plot <- plotrix::pyramid.plot(top20$successful, 
                                      top20$unsuccessful, labels=top20$term,
                                      # top.labels=c("Successful Projects", " ", "Unsuccessful Projects"), gap=25,
                                      main="Kickstarters: Success v. Failure Common Words",
                                      laxlab=NULL, raxlab=NULL, unit=NULL, lxcol=scol, rxcol=uscol
                                      ))

[1] 5.1 4.1 4.1 2.1

require(quanteda)
blurb.f.quant <- textstat_readability(txt$blurb, measure=c('Flesch.Kincaid'))
txt$IDdocument <- paste0('text',txt$ID)
blurb.f.quant$state <- txt$state[match(blurb.f.quant$document,txt$IDdocument)]
blurb.f.quant$state <- factor(blurb.f.quant$state, levels=c("successful","live","suspended","cancelled","failed"))
blurb.f.quant$backers_count <- txt$backers_count[match(blurb.f.quant$document,txt$IDdocument)]

ggplot(blurb.f.quant, aes(x=state, y=Flesch.Kincaid, color=Flesch.Kincaid)) +
  geom_point(alpha=.5) + scale_color_gradient(low="green", high="red") +
  xlab("") + ylab("Readability Score (Flesch-Kincaid Grade-Level)") + theme_minimal() + theme(legend.position="none")

ggplot(blurb.f.quant, aes(x=backers_count, y=Flesch.Kincaid, color=Flesch.Kincaid)) +
  geom_jitter(alpha=.5) + scale_color_gradient(low="green", high="red") +
  stat_smooth(method="lm", se=TRUE, alpha=.2) +
  xlab("Backer County") + ylab("Readability (Flesch-Kincaid Grade Level)") + theme_minimal() + theme(legend.position="none")

a) Stay positive

Calculate the tone of each text based on the positive and negative words that are being used. You can rely on the Hu & Liu dictionary provided in lecture or use the Bing dictionary contained in the tidytext package (tidytext::sentiments). Visualize the relationship between tone of the document and success. Briefly comment.

My friend and I have mac, which for some reason both wouldn’t let qdap install and run for the longest time. We came up with a function instead. Finally, I got qdap to work by going through my installed packages and deleting one that was interfering. It was a whole ordeal.

I imagine that the more positive the text, the more likely it was a project that didn’t fail. Like I say at a few other points in this assignment, it doesn’t seem like there’s a stark noticable difference between the positivity at face value. Both seem pretty optimistic.

tidytext::get_sentiments() %>% arrange(score) %>% as.data.frame(.) -> sentiment

sentimentscore <- function(text){
  score <- 0
  for(x in 1:nrow(sentiment)){
    count <- length(grep(sentiment[x,1], text)) # search if the text contains a word in the 'sentiment' dictionary
    if(count){ # if the word in the dictionary appears in the text for whatever times,
      score <- score + (count * sentiment[x,2]) # sentiment score = number of times the word appeared * corresponding sentiment valence of the word
      text <- sub(sentiment[x,1],'',text) # replace that word with blank now, so that it won't be counted twice
    }
  }
  score # output is valence score
}

unlist(lapply(txt$blurb, sentimentscore)) %>% cbind(txt, .) -> sentiment.df
colnames(sentiment.df)[27] <- "sentimentscore"
sentiment.df$state <- factor(sentiment.df$state, levels=c("successful","live","suspended","canceled","failed"))

ggplot(sentiment.df, aes(x = state, y = sentimentscore, color = as.numeric(sentimentscore))) +
  geom_point(alpha=.5) + scale_color_gradient(low="red", high="green") +
  xlab("State") + ylab("Sentiment") + theme_minimal() + theme(legend.position="none")

ggplot(sentiment.df, aes(x = as.numeric(backers_count), y = sentimentscore, color = as.numeric(sentimentscore))) +
  geom_jitter(alpha=.5) + scale_color_gradient(low="red", high="green") +
  stat_smooth(method="lm", se=TRUE, alpha=.2) +
  xlab("Numbers of backers") + ylab("Sentiment") + theme_minimal() + theme(legend.position="none")

b) Positive vs negative

Segregate all 2,000 blurbs into positive and negative texts based on their polarity score calculated in step (a). Now, collapse the positive and negative texts into two larger documents. Create a document-term-matrix based on this collapsed set of two documents. Generate a comparison cloud showing the most-frequent positive and negative words.

txt <- usa.success %>% select(country, blurb)
txt.2 <- usa.failed %>% select(country, blurb)

txt$Blurb1 <- iconv(txt$blurb, "latin1", "UTF-8", sub='')
txt.2$Blurb1 <- iconv(txt.2$blurb, "latin1", "UTF-8", sub='')

blurb.1.corp <- Corpus(VectorSource(txt$Blurb1)) # first five blurbs in blurb type 1
blurb.2.corp <- Corpus(VectorSource(txt.2$Blurb1)) # first five blurbs in blurb type 1

(corpus <- blurb.1.corp)

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 53

(corpus.2 <- blurb.2.corp)

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 35

corpus <- tm_map(corpus, stripWhitespace) # removing whitespace
corpus <- tm_map(corpus, removePunctuation) # removing punctuation
corpus <- tm_map(corpus, content_transformer(tolower)) # making lowercase for consistency
corpus <- tm_map(corpus, removeNumbers) # removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("english")) # removing English stopwords like transition words and articles
cleaned_corpus <- tm_map(corpus, stemDocument) # truncating words for broader meanings


corpus.2 <- tm_map(corpus.2, stripWhitespace) # removing whitespace
corpus.2 <- tm_map(corpus.2, removePunctuation) # removing punctuation
corpus.2 <- tm_map(corpus.2, content_transformer(tolower)) # making lowercase for consistency
corpus.2 <- tm_map(corpus.2, removeNumbers) # removing numbers
corpus.2 <- tm_map(corpus.2, removeWords, stopwords("english")) # removing English stopwords like transition words and articles
cleaned_corpus.2 <- tm_map(corpus.2, stemDocument) # truncating words for broader meanings


# essay 1
dtm <- tm::DocumentTermMatrix(corpus)
dtm.2 <- tm::DocumentTermMatrix(corpus.2)

Below is a DocumentTermMatrix: one row for each document in the corp, one column for each word (stem), and cell for the count of the # of times that word (stem) appears in that doc:

# success
dtm.list <- is.list(dtm)
dtm.mat <- as.matrix(dtm)  # dense form w/ plain matrices
library(Matrix)  # sparse form
dtm.Mat <- sparseMatrix(dtm$i, dtm$j, x = dtm$v,
                        dims = c(dtm$nrow, dtm$ncol),
                        dimnames = dtm$dimnames)
head(dtm.Mat[ , 1:7])  # just getting a look at some word frequencies. It reminds me of an SNA matrix

6 x 7 sparse Matrix of class "dgCMatrix"
    Terms
Docs battles building crafting epic game sandbox strategy
   1       1        1        1    1    1       1        1
   2       .        .        .    .    .       .        .
   3       .        .        .    .    .       .        .
   4       .        .        .    .    .       .        .
   5       .        .        .    .    1       .        .
   6       .        .        .    .    .       .        .

# failure
dtm.list.2 <- is.list(dtm.2)
dtm.mat.2 <- as.matrix(dtm.2)  # dense form w/ plain matrices
library(Matrix)  # sparse form
dtm.Mat.2 <- sparseMatrix(dtm.2$i, dtm.2$j, x = dtm.2$v,
                          dims = c(dtm.2$nrow, dtm.2$ncol),
                          dimnames = dtm.2$dimnames)
head(dtm.Mat.2[ , 1:7])  # just getting a look at some word frequencies. It reminds me of an SNA matrix

6 x 7 sparse Matrix of class "dgCMatrix"
    Terms
Docs concept entered film louisiana prize producing proof
   1       1       1    2         1     1         1     1
   2       .       .    .         .     .         .     .
   3       .       .    .         .     .         .     .
   4       .       .    .         .     .         .     .
   5       .       .    .         .     .         .     .
   6       .       .    .         .     .         .     .

This wordcloud includes words from the first 70 blurbs with the highest number of backers in the USA. I organized the data so that the top backer scores are descending from the highest number to the lowest number. Some of the top words that show up in my cloud are optimistic words like “better” and “openworld” and “access.” This makes sense coming from the most backed and successful project blurbs. Next, we’ll take a look at failed blurbs.

set.seed(2121)
wordcloud(colnames(dtm.Mat), dtm.Mat[1:nrow(dtm.Mat), ], max.words = 1000, colors = brewer.pal(6, "Dark2"), scale = c(2, .01))

This word cloud is about failed Kickstarter projects. Even though they’re referring to the failed projects, the words still seem pretty optimistic. Here I’m looking at the frequent words from the 70 blurbs with the least backers, all failures. Some word are more negative, like danger and backtrack, but there are still a lot of positive words like promoting and guardian and motivational.

# set.seed(2121)
wordcloud(colnames(dtm.Mat.2), dtm.Mat.2[1:nrow(dtm.Mat.2), ], max.words = 400, colors = brewer.pal(6, "Dark2"), scale = c(2, .01))

b) Success in words

Provide a pyramid plot to show how the words between successful and unsuccessful projects differ in frequency. A selection of 10 - 20 top words is sufficient here.

library(plotrix)
# usa.success

# success <- findFreqTerms(dtm, lowfreq = 20)  # finding words w/ frequency: 2000
(success <- findFreqTerms(dtm, lowfreq = 3))  # finding words w/ frequency: 2000

 [1] "building"   "epic"       "game"       "sandbox"    "tiny"      
 [6] "without"    "new"        "adventure"  "fantasy"    "world"     
[11] "phone"      "worlds"     "stylus"     "iphone"     "old"       
[16] "rpg"        "lets"       "watch"      "simple"     "miniatures"
[21] "alien"      "beautiful"  "edition"    "one"        "set"       
[26] "board"      "designed"   "part"       "asteroids"  "bluetooth"

(failed <- findFreqTerms(dtm.2, lowfreq = 3))  # finding words w/ frequency: 20

 [1] "film"    "bicycle" "kit"     "one"     "two"     "cards"   "deck"   
 [8] "playing" "sleep"   "total"

x <- success[1:20]
y <- failed[1:20]

# # Data Prep -> see the notes
# top25_df <- top50usa
# # Create the pyramid plot
# p <- plotrix::pyramid.plot(x, y,
#                   labels = top25_df$labels,
#                   # gap = 10,
#                   top.labels = c("Successful", " ", "Failed"),
#                   main = "Words in Common",
#                   laxlab = NULL,
#                   raxlab = NULL,
#                   unit = NULL,
#                   labelcex=0.5)

c) Simplicity as a virtue

These blurbs are short in length (max. 150 characters) but let’s see whether brevity and simplicity still matters. Calculate a readability measure (Flesh Reading Ease, Flesh Kincaid or any other comparable measure) for the texts. Visualize the relationship between the readability measure and one of the measures of success. Briefly comment on your finding.

3. Sentiment

Now, let’s check whether the use of positive / negative words or specific emotions helps a project to be successful.

head(words.1 <- data_frame(file = c(usa.success$country)) %>%
       mutate(text = list(usa.success %>% select(text = blurb)))  %>%
       unnest() %>%
       #  group_by(file = str_sub(basename(file), 1, -5)) %>%
       mutate(line_number = row_number()) %>%
       ungroup()) #

# A tibble: 6 x 3
   file text                                                    line_number
  <int> <fct>                                                         <int>
1    21 A sandbox strategy game with town building, crafting, …           1
2    21 The NeoLucida is a 19th-century optical drawing tool u…           2
3    21 High quality truly wireless earbuds, without the high …           3
4    21 Welcome 2 making magic. The time has finally come: De …           4
5    21 A party game in which it is TOTALLY POSSIBLE to drop a…           5
6    21 Augie and the Green Knight is an adventure story about…           6

words.1$text <- as.character(words.1$text)

head(words.1 <-
       words.1 %>%
       unnest_tokens(word, text)  %>%
       anti_join(stop_words) %>%
       mutate(word = wordStem(word)))

# A tibble: 6 x 3
   file line_number word    
  <int>       <int> <chr>   
1    21           1 sandbox 
2    21           1 strategi
3    21           1 game    
4    21           1 town    
5    21           1 build   
6    21           1 craft

head(sentiment.1 <- inner_join(words.1, get_sentiments("bing")) %>%
       count(file, index = round(line_number / max(line_number) * 100 / 5) * 5, sentiment) %>%
       spread(sentiment, n, fill = 0) %>%
       mutate(net_sentiment = positive - negative))

# A tibble: 6 x 5
   file index negative positive net_sentiment
  <int> <dbl>    <dbl>    <dbl>         <dbl>
1    21     0       25       39            14
2    21     5       55       78            23
3    21    10       51       86            35
4    21    15       54       75            21
5    21    20       55       79            24
6    21    25       50       83            33

sentiment.1 %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ file) +
  scale_x_continuous("Location in paper (percent)") +
  scale_y_continuous("Bing Net Sentiment")

head(pw <- words.1[, c("file", "word")]) # count each word per by first subsetting relevant rows

# A tibble: 6 x 2
   file word    
  <int> <chr>   
1    21 sandbox 
2    21 strategi
3    21 game    
4    21 town    
5    21 build   
6    21 craft

head(d <-  count_(pw, c("file", "word"))) # # count each word per by first subsetting relevant rows

# A tibble: 6 x 3
   file word      n
  <int> <chr> <int>
1    21 1        53
2    21 100      53
3    21 15th     53
4    21 19th     53
5    21 2        53
6    21 21st     53

(pwdtm <- dtm) # this is my document term matrix from above

<<DocumentTermMatrix (documents: 53, terms: 442)>>
Non-/sparse entries: 568/22858
Sparsity           : 98%
Maximal term length: 17
Weighting          : term frequency (tf)

mpwdtm = as.matrix(pwdtm)
df.mpwdtm = as.data.frame(mpwdtm)
t.t <- t(mpwdtm) # t has the essays as variables and I can see as a matrix if they contain the same words as each other
t.t.t <- data.frame(t(t.t)) # transpose again! looking at the data frame with the words as variables and the essays as the observations. One essay corresponds to one participant
# head(t.t[ , 1:20]) %>% DT::datatable()  # tells how many of each term is in each doc
head(cor(t.t)) %>% DT::datatable() # we can see how similar documents are to one another!! Ie: document 3 matchs document 9 by 23.99%.

dim(t.t) # 1237 observations (words) an 92 variables (essays)

[1] 442  53

t.df <- data.frame(t.t) # my fav form for analysis

words.sentiment.1 <- data.frame(sentiment.1)
words.sentiment.1$id <- sentiment.1$file
txt.sentiment.1 <- merge(sentiment.1, usa.success)
# txt.sentiment.1$staff_pick %>% data.frame() %>% DT::datatable() 
summary(lm(positive ~ staff_pick, txt.sentiment.1))


Call:
lm(formula = positive ~ staff_pick, data = txt.sentiment.1)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.714   1.286   3.286   5.286  10.286 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.571e+01  5.456e-01   138.8   <2e-16 ***
staff_picktrue 1.013e-14  7.507e-01     0.0        1    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.5 on 1111 degrees of freedom
Multiple R-squared:  9.715e-29, Adjusted R-squared:  -0.0009001 
F-statistic: 1.079e-25 on 1 and 1111 DF,  p-value: 1

ggplot(txt.sentiment.1, aes(staff_pick, positive)) + geom_point() + geom_jitter()

c) Get in their mind

Now, use the NRC Word-Emotion Association Lexicon in the tidytext package to identify a larger set of emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). Again, visualize the relationship between the use of words from these categories and success. What is your finding?

Data Visualization: Kickstarter Projects

Spring 2018 | Assignment 3

Samantha Rhoads

2018-04-01

Text Mining Kickstarter Projects

Overview

Data

Tasks for the Assignment

1. Identifying Successful Projects

2. Writing your success story

a) Cleaning the Text and Word Cloud

Below I’m just playing with the functions I took the time to learn about. It’s pretty cool what you can extrapolate quantitatively. The words stress and Columbia dominate in terms of most used.

a) Stay positive

b) Positive vs negative

Below is a DocumentTermMatrix: one row for each document in the corp, one column for each word (stem), and cell for the count of the # of times that word (stem) appears in that doc:

b) Success in words

c) Simplicity as a virtue

3. Sentiment

c) Get in their mind

Data Visualization: Kickstarter Projects

Spring 2018 | Assignment 3

Samantha Rhoads

2018-04-01

Text Mining Kickstarter Projects

Overview

Data

Tasks for the Assignment

1. Identifying Successful Projects

2. Writing your success story

a) Cleaning the Text and Word Cloud

Below I’m just playing with the functions I took the time to learn about. It’s pretty cool what you can extrapolate quantitatively. The words stress and Columbia dominate in terms of most used.

I’m plotting my DocumentTermMatrix below. It’s reminiscent of a social network. The connected words basically are shown as nodes connected by ties.

a) Stay positive

b) Positive vs negative

Below is a DocumentTermMatrix: one row for each document in the corp, one column for each word (stem), and cell for the count of the # of times that word (stem) appears in that doc:

b) Success in words

c) Simplicity as a virtue

3. Sentiment

c) Get in their mind

I’m plotting my `DocumentTermMatrix` below. It’s reminiscent of a social network. The connected words basically are shown as nodes connected by ties.