Comparision of focus topics for two authors pubishing article on Fast Company magazine

1 Introduction

In order to assess the difference of topics for two authors (Evie Nagy and Morgan Clendanie) pubishing article on Fast Company magazine. 12 articals are randomly selected from this website.The results show that Evie Nagy focus on the topics about product brand, platform build and work time ect, but Morgan Clendaniel focus on the topics about build new house, will and robot topics.

1.1 Research Questions

AIM1: Assess the difference of high frequency words used by two authors.
AIM2: Used different methods to summary the different topics are focused by those two authors.

1.2 Data

For each artical are randomly selected from the Fast Company magazine written by those two authors. All articals are stored in text files. Each artical contains over 1000 words. All words are cleaned up and finally transformed into word stem.

2 Methods

2.1 First Research Questions

2.1.1 The Descriptive Statistics

The Term-Document martixes are illustrated. They will show the sparsity of Term-Document matrix and the proportion of Non/sparsity entries.The results show that Evie Nagy uses more words than Morgan Clendaniel.

Tdm1 <- TermDocumentMatrix(translate_docs1, control=list(wordLengths=c(1,Inf)))
Tdm1

## A term-document matrix (1252 terms, 6 documents)
## 
## Non-/sparse entries: 1889/5623
## Sparsity           : 75%
## Maximal term length: 19 
## Weighting          : term frequency (tf)

Tdm2 <- TermDocumentMatrix(translate_docs2, control=list(wordLengths=c(1,Inf)))
Tdm2

## A term-document matrix (1064 terms, 6 documents)
## 
## Non-/sparse entries: 1623/4761
## Sparsity           : 75%
## Maximal term length: 24 
## Weighting          : term frequency (tf)

#checking for the docs1
#idx1 <- which(dimnames(Tdm1)$Terms == "able")
#inspect(Tdm1[idx1+(0:5),1:6])
#checking for the docs2
#idx2 <- which(dimnames(Tdm2)$Terms == "abdomen")
#inspect(Tdm2[idx2+(0:5),1:3])

2.1.2 Frequence analysis for word stem

Checking quartile of words fequency in each author’s papers. These results shows that the second author Morgan Clendaniel likes using some special words in his articles than Evie Nagy.
Frequence tables and plots will intuitively show that the word frequency greater than 450 times in those two authors.

termFrequency1 <- rowSums(as.matrix(Tdm1))
termFrequency_1 <- subset(termFrequency1, termFrequency1>=500)
quantile(termFrequency1)

##   0%  25%  50%  75% 100% 
##   24   31   66  132 2653

termFrequency2 <- rowSums(as.matrix(Tdm2))

#after barplot, i see will is not necessary in the data remove it
#termFrequency2<-termFrequency2[which(!termFrequency2==2871)] maybe not need to remove

termFrequency_2 <- subset(termFrequency2, termFrequency2>=500)
quantile(termFrequency2)

##   0%  25%  50%  75% 100% 
##   23   62   96  186 2871

findFreqTerms(Tdm1,  lowfreq=450)

##  [1] "also"       "app"        "around"     "athletes"   "audience"  
##  [6] "biggest"    "brand"      "build"      "cadillac"   "can"       
## [11] "car"        "change"     "companies"  "consumers"  "content"   
## [16] "cosmo"      "create"     "different"  "digital"    "drive"     
## [21] "first"      "focused"    "get"        "going"      "good"      
## [26] "growth"     "hearst"     "important"  "interested" "issue"     
## [31] "just"       "lee"        "like"       "look"       "lot"       
## [36] "made"       "magazine"   "make"       "months"     "needs"     
## [41] "new"        "one"        "part"       "people"     "products"  
## [46] "says"       "site"       "system"     "take"       "talk"      
## [51] "team"       "thats"      "things"     "think"      "time"      
## [56] "us"         "user"       "want"       "way"        "will"      
## [61] "women"      "work"       "year"       "young"

findFreqTerms(Tdm2,  lowfreq=450)

##  [1] "america"         "around"          "best"           
##  [4] "building"        "can"             "change"         
##  [7] "cities"          "come"            "companies"      
## [10] "day"             "design"          "environment"    
## [13] "everyone"        "get"             "google"         
## [16] "happening"       "happiest"        "health"         
## [19] "house"           "ideas"           "imagine"        
## [22] "infographics"    "innovative"      "just"           
## [25] "last"            "like"            "live"           
## [28] "look"            "make"            "maps"           
## [31] "market"          "microapartments" "new"            
## [34] "next"            "notes"           "now"            
## [37] "one"             "people"          "phone"          
## [40] "photography"     "photos"          "plus"           
## [43] "portlandia"      "project"         "public"         
## [46] "read"            "residents"       "robots"         
## [49] "says"            "see"             "show"           
## [52] "soon"            "stories"         "students"       
## [55] "take"            "thats"           "time"           
## [58] "top"             "towers"          "two"            
## [61] "us"              "use"             "want"           
## [64] "way"             "will"            "workplaces"     
## [67] "world"           "year"

df1 <- data.frame(term = names(termFrequency_1), freq = termFrequency_1)
ggplot(df1, aes(x=term, y=freq,fill=freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Count") +coord_flip()

df2 <- data.frame(term = names(termFrequency_2), freq = termFrequency_2)
ggplot(df2, aes(x=term, y=freq,fill=freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Count") +coord_flip()

2.1.3 Association analysis for highest frequence word

Association analysis for the highest frequence word brand in Evie Nagy articals.First 5 words is listed which 90% highly associate with brand.
Association analysis for the highest frequence word brand in Morgan Clendaniel articals.First 5 words is listed which 90% highly associate with robots

#findFreqTerms(Tdm1,  lowfreq=450)
#findFreqTerms(Tdm2,  lowfreq=450)
t1<-findAssocs(Tdm1,'brand', 0.9)
rownames(t1)[1:5]

## [1] "ago"       "best"      "just"      "yearold"   "consumers"

t2<-findAssocs(Tdm2,'robots', 0.9)
rownames(t2)[1:5]

## [1] "abdomen"   "abstract"  "always"    "attached"  "automated"

2.2 Second Research Questions

2.2.1 Word Cloud Analysis

Word cloud analysis is used to visualize words with frequence greater than 500 times in each author’s papers.

Wold cloud result indicates that Evie Nagy focuses on topics including brand, cadillac, women, products and etc.
Wold cloud result indicates that Morgan Clendaniel focuses on topics including cities, will, robots, building, will and etc.

set.seed(375)
wordFreq1 <- sort(subset(termFrequency1, termFrequency1>=500), decreasing=TRUE)

wordcloud(words=names(wordFreq1), freq=wordFreq1, min.freq=3, random.order=F,
colors=rainbow(12))

wordFreq2 <- sort(subset(termFrequency2, termFrequency2>=500), decreasing=TRUE)

wordcloud(words=names(wordFreq2), freq=wordFreq2, min.freq=3, random.order=F,
colors=rainbow(12))

2.2.2 Hierarchical Clustering Analysis

Hierarchical clustering Analysis is used to find out the association between the high frequency words in each author’s papers, and then summary different topics for each author.

For Evie Nagy, she likes to pay attention on brand, user save will, work time important and so on.
For Morgan Clendaniel, he likes to pay attention on robots, will, buiding new house in city and so on.

Tdm1_den <- removeSparseTerms(Tdm1, sparse=0.4)
m1 <- as.matrix(Tdm1_den)
distMatrix1 <- dist(scale(m1))
fit1 <- hclust(distMatrix1, method="ward.D2")
plot(fit1)
rect.hclust(fit1, k=6)

(groups <- cutree(fit1, k=6))

##        also      around     biggest       brand       build         can 
##           1           2           1           3           4           1 
##    comments   companies   corporate      create      design development 
##           1           1           1           2           1           1 
##        dont       email        evie        feel       first  foundation 
##           1           1           1           1           4           1 
##        give       going      growth        help        idea   important 
##           1           2           1           1           1           5 
##   including       issue        like   listening         lot        make 
##           1           1           4           1           2           4 
##       money       needs         new        next         one        part 
##           1           4           2           1           4           1 
##     partner      people       plans    platform    products        said 
##           1           2           4           4           2           5 
##        says        sell       shows       still        take  technology 
##           6           1           1           1           1           1 
##      theyre      things        time         two  understand        user 
##           1           4           5           1           1           6 
##         way        will        work       world        year 
##           2           6           5           1           4

Tdm2_den <- removeSparseTerms(Tdm2, sparse=0.4)
m2 <- as.matrix(Tdm2_den)
distMatrix2 <- dist(scale(m2))
fit2 <- hclust(distMatrix2, method="ward.D2")
plot(fit2)
rect.hclust(fit2, k=6)

(groups <- cutree(fit2, k=6))

##         around         become           best          bikes       building 
##              1              1              1              1              2 
##            can     categories         change         cities  collaborative 
##              3              1              4              2              1 
##           come    consumption   crowdfunding            day         design 
##              1              1              1              1              2 
##        editors      education   endoftheyear         enough    environment 
##              1              1              1              1              1 
##       everyone           food         future            get         google 
##              4              1              1              4              1 
##         health          house           huge   infographics     innovative 
##              4              2              1              1              1 
##           just           last            let           like           live 
##              1              1              1              3              2 
##           look           make           maps           much            new 
##              3              4              1              1              2 
##          notes            now         people            pin           plus 
##              1              3              3              1              1 
##        project           read         robots       roundups           says 
##              1              1              5              1              4 
##            see          share           soon        stories           take 
##              1              1              1              2              1 
##           time          today            top transportation      tricorder 
##              1              1              1              1              1 
##          tweet             us            use           want            way 
##              1              1              4              1              4 
##           will     workplaces          world           year 
##              6              1              3              2

2.2.3 K-means Clustering Analysis

K-means Clustering Analysis is used to find out the association between the high frequency words in each author’s papers.The output below will show that what topics are found by K-mean method.

set.seed(122)
m1_T <- t(m1)
k <- 5
kmeansResult1 <- kmeans(m1_T, 5)
#round(kmeansResult1$centers, digits=3)

for (i in 1:k) {
  cat(paste("cluster ", i, ": ", sep=""))
  s <- sort(kmeansResult1$centers[i,], decreasing=T)
  cat(names(s)[1:3], "\n")
}

## cluster 1: time important work 
## cluster 2: people way going 
## cluster 3: says user will 
## cluster 4: platform says year 
## cluster 5: brand important products

m2_T <- t(m2)
k <- 5
kmeansResult2 <- kmeans(m2_T, 5)
#round(kmeansResult2$centers, digits=3)

for (i in 1:k) {
  cat(paste("cluster ", i, ": ", sep=""))
  s <- sort(kmeansResult2$centers[i,], decreasing=T)
  cat(names(s)[1:3], "\n")
}

## cluster 1: building house new 
## cluster 2: will year like 
## cluster 3: world make people 
## cluster 4: cities people look 
## cluster 5: robots will us

3 Results

Several methods are used in this analysis, different methods have its own pros and cons. From results of four analysis, Evie Nagy and Morgan Clendaniel have different preferences.
Evie Nagy focus on the topics about product brand, platform build and work time ect.
Morgan Clendaniel focus on the topics about build new house, will and robot ect.

4 Discussion

The limiation includes that the data is not large enough, and other unsupervised clustering methods can be used here to cluster the different topics by those two authors.

Text Mining

Nan Zhang

February 26, 2016