In order to assess the difference of topics for two authors (Evie Nagy and Morgan Clendanie) pubishing article on Fast Company magazine. 12 articals are randomly selected from this website.The results show that Evie Nagy focus on the topics about product brand, platform build and work time ect, but Morgan Clendaniel focus on the topics about build new house, will and robot topics.
AIM1: Assess the difference of high frequency words used by two authors.
AIM2: Used different methods to summary the different topics are focused by those two authors.
For each artical are randomly selected from the Fast Company magazine written by those two authors. All articals are stored in text files. Each artical contains over 1000 words. All words are cleaned up and finally transformed into word stem.
The Term-Document martixes are illustrated. They will show the sparsity of Term-Document matrix and the proportion of Non/sparsity entries.The results show that Evie Nagy uses more words than Morgan Clendaniel.
Tdm1 <- TermDocumentMatrix(translate_docs1, control=list(wordLengths=c(1,Inf)))
Tdm1
## A term-document matrix (1252 terms, 6 documents)
##
## Non-/sparse entries: 1889/5623
## Sparsity : 75%
## Maximal term length: 19
## Weighting : term frequency (tf)
Tdm2 <- TermDocumentMatrix(translate_docs2, control=list(wordLengths=c(1,Inf)))
Tdm2
## A term-document matrix (1064 terms, 6 documents)
##
## Non-/sparse entries: 1623/4761
## Sparsity : 75%
## Maximal term length: 24
## Weighting : term frequency (tf)
#checking for the docs1
#idx1 <- which(dimnames(Tdm1)$Terms == "able")
#inspect(Tdm1[idx1+(0:5),1:6])
#checking for the docs2
#idx2 <- which(dimnames(Tdm2)$Terms == "abdomen")
#inspect(Tdm2[idx2+(0:5),1:3])
Checking quartile of words fequency in each author’s papers. These results shows that the second author Morgan Clendaniel likes using some special words in his articles than Evie Nagy.
Frequence tables and plots will intuitively show that the word frequency greater than 450 times in those two authors.
termFrequency1 <- rowSums(as.matrix(Tdm1))
termFrequency_1 <- subset(termFrequency1, termFrequency1>=500)
quantile(termFrequency1)
## 0% 25% 50% 75% 100%
## 24 31 66 132 2653
termFrequency2 <- rowSums(as.matrix(Tdm2))
#after barplot, i see will is not necessary in the data remove it
#termFrequency2<-termFrequency2[which(!termFrequency2==2871)] maybe not need to remove
termFrequency_2 <- subset(termFrequency2, termFrequency2>=500)
quantile(termFrequency2)
## 0% 25% 50% 75% 100%
## 23 62 96 186 2871
findFreqTerms(Tdm1, lowfreq=450)
## [1] "also" "app" "around" "athletes" "audience"
## [6] "biggest" "brand" "build" "cadillac" "can"
## [11] "car" "change" "companies" "consumers" "content"
## [16] "cosmo" "create" "different" "digital" "drive"
## [21] "first" "focused" "get" "going" "good"
## [26] "growth" "hearst" "important" "interested" "issue"
## [31] "just" "lee" "like" "look" "lot"
## [36] "made" "magazine" "make" "months" "needs"
## [41] "new" "one" "part" "people" "products"
## [46] "says" "site" "system" "take" "talk"
## [51] "team" "thats" "things" "think" "time"
## [56] "us" "user" "want" "way" "will"
## [61] "women" "work" "year" "young"
findFreqTerms(Tdm2, lowfreq=450)
## [1] "america" "around" "best"
## [4] "building" "can" "change"
## [7] "cities" "come" "companies"
## [10] "day" "design" "environment"
## [13] "everyone" "get" "google"
## [16] "happening" "happiest" "health"
## [19] "house" "ideas" "imagine"
## [22] "infographics" "innovative" "just"
## [25] "last" "like" "live"
## [28] "look" "make" "maps"
## [31] "market" "microapartments" "new"
## [34] "next" "notes" "now"
## [37] "one" "people" "phone"
## [40] "photography" "photos" "plus"
## [43] "portlandia" "project" "public"
## [46] "read" "residents" "robots"
## [49] "says" "see" "show"
## [52] "soon" "stories" "students"
## [55] "take" "thats" "time"
## [58] "top" "towers" "two"
## [61] "us" "use" "want"
## [64] "way" "will" "workplaces"
## [67] "world" "year"
df1 <- data.frame(term = names(termFrequency_1), freq = termFrequency_1)
ggplot(df1, aes(x=term, y=freq,fill=freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Count") +coord_flip()
df2 <- data.frame(term = names(termFrequency_2), freq = termFrequency_2)
ggplot(df2, aes(x=term, y=freq,fill=freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Count") +coord_flip()
Association analysis for the highest frequence word brand in Evie Nagy articals.First 5 words is listed which 90% highly associate with brand.
Association analysis for the highest frequence word brand in Morgan Clendaniel articals.First 5 words is listed which 90% highly associate with robots
#findFreqTerms(Tdm1, lowfreq=450)
#findFreqTerms(Tdm2, lowfreq=450)
t1<-findAssocs(Tdm1,'brand', 0.9)
rownames(t1)[1:5]
## [1] "ago" "best" "just" "yearold" "consumers"
t2<-findAssocs(Tdm2,'robots', 0.9)
rownames(t2)[1:5]
## [1] "abdomen" "abstract" "always" "attached" "automated"
Word cloud analysis is used to visualize words with frequence greater than 500 times in each author’s papers.
Wold cloud result indicates that Evie Nagy focuses on topics including brand, cadillac, women, products and etc.
Wold cloud result indicates that Morgan Clendaniel focuses on topics including cities, will, robots, building, will and etc.
set.seed(375)
wordFreq1 <- sort(subset(termFrequency1, termFrequency1>=500), decreasing=TRUE)
wordcloud(words=names(wordFreq1), freq=wordFreq1, min.freq=3, random.order=F,
colors=rainbow(12))
wordFreq2 <- sort(subset(termFrequency2, termFrequency2>=500), decreasing=TRUE)
wordcloud(words=names(wordFreq2), freq=wordFreq2, min.freq=3, random.order=F,
colors=rainbow(12))
Hierarchical clustering Analysis is used to find out the association between the high frequency words in each author’s papers, and then summary different topics for each author.
For Evie Nagy, she likes to pay attention on brand, user save will, work time important and so on.
For Morgan Clendaniel, he likes to pay attention on robots, will, buiding new house in city and so on.
Tdm1_den <- removeSparseTerms(Tdm1, sparse=0.4)
m1 <- as.matrix(Tdm1_den)
distMatrix1 <- dist(scale(m1))
fit1 <- hclust(distMatrix1, method="ward.D2")
plot(fit1)
rect.hclust(fit1, k=6)
(groups <- cutree(fit1, k=6))
## also around biggest brand build can
## 1 2 1 3 4 1
## comments companies corporate create design development
## 1 1 1 2 1 1
## dont email evie feel first foundation
## 1 1 1 1 4 1
## give going growth help idea important
## 1 2 1 1 1 5
## including issue like listening lot make
## 1 1 4 1 2 4
## money needs new next one part
## 1 4 2 1 4 1
## partner people plans platform products said
## 1 2 4 4 2 5
## says sell shows still take technology
## 6 1 1 1 1 1
## theyre things time two understand user
## 1 4 5 1 1 6
## way will work world year
## 2 6 5 1 4
Tdm2_den <- removeSparseTerms(Tdm2, sparse=0.4)
m2 <- as.matrix(Tdm2_den)
distMatrix2 <- dist(scale(m2))
fit2 <- hclust(distMatrix2, method="ward.D2")
plot(fit2)
rect.hclust(fit2, k=6)
(groups <- cutree(fit2, k=6))
## around become best bikes building
## 1 1 1 1 2
## can categories change cities collaborative
## 3 1 4 2 1
## come consumption crowdfunding day design
## 1 1 1 1 2
## editors education endoftheyear enough environment
## 1 1 1 1 1
## everyone food future get google
## 4 1 1 4 1
## health house huge infographics innovative
## 4 2 1 1 1
## just last let like live
## 1 1 1 3 2
## look make maps much new
## 3 4 1 1 2
## notes now people pin plus
## 1 3 3 1 1
## project read robots roundups says
## 1 1 5 1 4
## see share soon stories take
## 1 1 1 2 1
## time today top transportation tricorder
## 1 1 1 1 1
## tweet us use want way
## 1 1 4 1 4
## will workplaces world year
## 6 1 3 2
K-means Clustering Analysis is used to find out the association between the high frequency words in each author’s papers.The output below will show that what topics are found by K-mean method.
set.seed(122)
m1_T <- t(m1)
k <- 5
kmeansResult1 <- kmeans(m1_T, 5)
#round(kmeansResult1$centers, digits=3)
for (i in 1:k) {
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult1$centers[i,], decreasing=T)
cat(names(s)[1:3], "\n")
}
## cluster 1: time important work
## cluster 2: people way going
## cluster 3: says user will
## cluster 4: platform says year
## cluster 5: brand important products
m2_T <- t(m2)
k <- 5
kmeansResult2 <- kmeans(m2_T, 5)
#round(kmeansResult2$centers, digits=3)
for (i in 1:k) {
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult2$centers[i,], decreasing=T)
cat(names(s)[1:3], "\n")
}
## cluster 1: building house new
## cluster 2: will year like
## cluster 3: world make people
## cluster 4: cities people look
## cluster 5: robots will us
Several methods are used in this analysis, different methods have its own pros and cons. From results of four analysis, Evie Nagy and Morgan Clendaniel have different preferences.
Evie Nagy focus on the topics about product brand, platform build and work time ect.
Morgan Clendaniel focus on the topics about build new house, will and robot ect.
The limiation includes that the data is not large enough, and other unsupervised clustering methods can be used here to cluster the different topics by those two authors.