主要議題:文字雲的作法

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
library(magrittr)
library(tm)
Loading required package: NLP

Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(slam)



1. Papare the Data

1.1 字頻表 Document-Term-Matrix

Download the dataset “tweets.csv”, and load it into a data frame called “tweets” using the read.csv() function, remembering to use stringsAsFactors=FALSE when loading the data.

Next, perform the following pre-processing tasks (like we did in Unit 5), noting that we don’t stem the words in the document or remove sparse terms:

  1. Create a corpus using the Tweet variable
  2. Convert the corpus to lowercase
  3. Remove punctuation from the corpus
  4. Remove all English-language stopwords
  5. Build a document-term matrix out of the corpus
  6. Convert the document-term matrix to a data frame called allTweets
tw = read.csv('data/tweets.csv',stringsAsFactors = F)
corpus = Corpus(VectorSource(tw$Tweet))
corpus = tm_map(corpus, content_transformer(tolower))
transformation drops documents
corpus = tm_map(corpus, removePunctuation)
transformation drops documents
corpus = tm_map(corpus, removeWords, stopwords("english"))
transformation drops documents
# corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
# dtm = removeSparseTerms(dtm, 0.995); dtm

How many unique words are there across all the documents?

dim(dtm)
[1] 1181 3780
To Stem or Not To Stem

Although we typically stem words during the text preprocessing step, we did not do so here. What is the most compelling rationale for skipping this step when visualizing text data?

  • It will be easier to read and understand the word cloud if it includes full words instead of just the word stems


2 Building a Word Cloud

2.1 The wordcloud package

Install and load the “wordcloud” package, which is needed to build word clouds.

library(wordcloud)
library(slam)

As we can read from ?wordcloud, we will need to provide the function with a vector of words and a vector of word frequencies. Which function can we apply to allTweets to get a vector of the words in our dataset, which we’ll pass as the first argument to wordcloud()?

colnames(dtm) %>% head(30)
 [1] "apple"                   "appstore"                "best"                   
 [4] "care"                    "customer"                "ever"                   
 [7] "far"                     "received"                "say"                    
[10] "service"                 "beautiful"               "fricking"               
[13] "ios"                     "smooth"                  "thanxapple"             
[16] "love"                    "iphone"                  "iphone5s"               
[19] "loving"                  "new"                     "pictwittercomxmhjcu4pcb"
[22] "thank"                   "10min"                   "phone"                  
[25] "amazing"                 "ear"                     "headphones"             
[28] "inear"                   "ive"                     "pods"                   
2.2 Word Frequency in the Corpus

Which function should we apply to allTweets to obtain the frequency of each word across all tweets?

col_sums(dtm) %>% head(12)
    apple  appstore      best      care  customer      ever       far  received 
     1297         7        12        11         7         9         3         1 
      say   service beautiful  fricking 
       11        15         2         1 
2.3 The Most Frequent Word

Use allTweets to build a word cloud. Make sure to check out the help page for wordcloud if you are not sure how to do this.

Because we are plotting a large number of words, you might get warnings that some of the words could not be fit on the page and were therefore not plotted – this is especially likely if you are using a smaller screen. You can address these warnings by plotting the words smaller. From ?wordcloud, we can see that the “scale” parameter controls the sizes of the plotted words. By default, the sizes range from 4 for the most frequent words to 0.5 for the least frequent, as denoted by the parameter “scale=c(4, 0.5)”. We could obtain a much smaller plot with, for instance, parameter “scale=c(2, 0.25)”.

What is the most common word across all the tweets (it will be the largest in the outputted word cloud)? Please type the word exactly how you see it in the word cloud. The most frequent word might not be printed if you got a warning about words being cut off – if this happened, be sure to follow the instructions in the paragraph above.

wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.5))

2.4 Remove the ‘Indexing’ Keyword

In the previous subproblem, we noted that there is one word with a much higher frequency than the other words. Repeat the steps to load and pre-process the corpus, this time removing the most frequent word in addition to all elements of stopwords(“english”) in the call to tm_map with removeWords. For a refresher on how to remove this additional word, see the Twitter text analytics lecture.

Replace allTweets with the document-term matrix of this new corpus – we will use this updated corpus for the remainder of the assignment.

Create a word cloud with the updated corpus.

corpus = Corpus(VectorSource(tw$Tweet))
corpus = tm_map(corpus, content_transformer(tolower))
transformation drops documents
corpus = tm_map(corpus, removePunctuation)
transformation drops documents
corpus = tm_map(corpus, removeWords, 
                c('apple',stopwords("english")) )
transformation drops documents
dtm = DocumentTermMatrix(corpus)
# iphone

What is the most common word in this new corpus (the largest word in the outputted word cloud)?

wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.5))



Size and Color

So far, the word clouds we’ve built have not been too visually appealing – they are crowded by having too many words displayed, and they don’t take advantage of color. One important step to building visually appealing visualizations is to experiment with the parameters available, which in this case can be viewed by typing ?wordcloud in your R console. In this problem, you should look through the help page and experiment with different parameters to answer the questions.

Below are four word clouds, each of which uses different parameter settings in the call to the wordcloud() function:

Word Cloud A:

wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.4),
          rot.per=0.5) 

Word Cloud B:

wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.4),
          min.freq=8, random.order=F)     # B

Word Cloud C:

dtm1 = dtm[tw$Avg <= -1,]
wordcloud(colnames(dtm1), col_sums(dtm1), scale=c(3, 0.4),
          colors = brewer.pal(9,"Purples")[6:9] ) # C

Word Cloud D:

wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.4),
  min.freq=8, random.order=F, random.color=T,
  colors = brewer.pal(9,"Purples")[6:9] ) # D

3.1

Which word cloud is based only on the negative tweets (tweets with Avg value -1 or less)?

  • Word Cloud C
3.2

Only one word cloud was created without modifying parameters min.freq or max.words. Which word cloud is this?_

  • Word Cloud A
3.3

Which word clouds were created with parameter random.order set to FALSE?

  • Word Cloud B
  • Word Cloud D
3.4

Which word cloud was built with a non-default value for parameter rot.per?

  • Word Cloud A
3.5

In Word Cloud C and Word Cloud D, we provided a color palette ranging from light purple to dark purple as the parameter colors (you will learn how to make such a color palette later in this assignment). For which word cloud was the parameter random.color set to TRUE?

  • Word Cloud D


4. Selecting a Color Palette

The use of a palette of colors can often improve the overall effect of a visualization. We can easily select our own colors when plotting; for instance, we could pass c(“red”, “green”, “blue”) as the colors parameter to wordcloud(). The RColorBrewer package, which is based on the ColorBrewer project (colorbrewer.org), provides pre-selected palettes that can lead to more visually appealing images. Though these palettes are designed specifically for coloring maps, we can also use them in our word clouds and other visualizations.

Begin by installing and loading the “RColorBrewer” package. This package may have already been installed and loaded when you installed and loaded the “wordcloud” package, in which case you don’t need to go through this additional installation step. If you obtain errors (for instance, “Error: lazy-load database ‘P’ is corrupt”) after installing and loading the RColorBrewer package and running some of the commands, try closing and re-opening R.

The function brewer.pal() returns color palettes from the ColorBrewer project when provided with appropriate parameters, and the function display.brewer.all() displays the palettes we can choose from.

4.1

Which color palette would be most appropriate for use in a word cloud for which we want to use color to indicate word frequency?

library(RColorBrewer)
display.brewer.all()

4.2

Which RColorBrewer palette name would be most appropriate to use when preparing an image for a document that must be in grayscale?

rr display.brewer.pal(7, )

4.3

n sequential palettes, sometimes there is an undesirably large contrast between the lightest and darkest colors. You can see this effect when plotting a word cloud for allTweets with parameter colors=brewer.pal(9, “Blues”), which returns a sequential blue palette with 9 colors.

Which of the following commands addresses this issue by removing the first 4 elements of the 9-color palette of blue colors? Select all that apply.

brewer.pal(9, "Blues")[c(5,6,7,8,9)] 
[1] "#6BAED6" "#4292C6" "#2171B5" "#08519C" "#08306B"
brewer.pal(9, "Blues")[c(-1,-2,-3,-4)] 
[1] "#6BAED6" "#4292C6" "#2171B5" "#08519C" "#08306B"








---
title: "AS7-3 文字雲"
author: "陳怡安, M064112014"
output: html_notebook
---

<br>

**主要議題：文字雲的作法**

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
library(magrittr)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(slam)
```
<br>

- - -

### 1. Papare the Data

##### 1.1 字頻表 Document-Term-Matrix
Download the dataset "tweets.csv", and load it into a data frame called "tweets" using the read.csv() function, remembering to use stringsAsFactors=FALSE when loading the data.

Next, perform the following pre-processing tasks (like we did in Unit 5), noting that we don't stem the words in the document or remove sparse terms:

1) Create a corpus using the Tweet variable
2) Convert the corpus to lowercase
3) Remove punctuation from the corpus
4) Remove all English-language stopwords
5) Build a document-term matrix out of the corpus
6) Convert the document-term matrix to a data frame called allTweets

```{r}
tw = read.csv('data/tweets.csv',stringsAsFactors = F)
corpus = Corpus(VectorSource(tw$Tweet))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
# corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
# dtm = removeSparseTerms(dtm, 0.995); dtm
```

_How many unique words are there across all the documents?_
```{r}
dim(dtm)
```

##### To Stem or Not To Stem
Although we typically stem words during the text preprocessing step, we did not do so here. _What is the most compelling rationale for skipping this step when visualizing text data?_

+ It will be easier to read and understand the word cloud if it includes full words instead of just the word stems 

<br><hr>

### 2 Building a Word Cloud

##### 2.1 The `wordcloud` package
Install and load the "wordcloud" package, which is needed to build word clouds.
```{r}
library(wordcloud)
library(slam)
```
As we can read from ?wordcloud, we will need to provide the function with a vector of words and a vector of word frequencies. _Which function can we apply to allTweets to get a vector of the words in our dataset, which we'll pass as the first argument to `wordcloud()`?_
```{r}
colnames(dtm) %>% head(30) #colnames
```

##### 2.2 Word Frequency in the Corpus
_Which function should we apply to allTweets to obtain the frequency of each word across all tweets?_
```{r}
col_sums(dtm) %>% head(12) #colSums
```

##### 2.3 The Most Frequent Word
Use allTweets to build a word cloud. Make sure to check out the help page for wordcloud if you are not sure how to do this.

Because we are plotting a large number of words, you might get warnings that some of the words could not be fit on the page and were therefore not plotted -- this is especially likely if you are using a smaller screen. You can address these warnings by plotting the words smaller. From `?wordcloud`, we can see that the "scale" parameter controls the sizes of the plotted words. By default, the sizes range from 4 for the most frequent words to 0.5 for the least frequent, as denoted by the parameter "scale=c(4, 0.5)". We could obtain a much smaller plot with, for instance, parameter "scale=c(2, 0.25)".

_What is the most common word across all the tweets_ (it will be the largest in the outputted word cloud)? Please type the word exactly how you see it in the word cloud. The most frequent word might not be printed if you got a warning about words being cut off -- if this happened, be sure to follow the instructions in the paragraph above.
```{r fig.height=6, fig.width=6}
wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.5)) #apple
```

##### 2.4 Remove the 'Indexing' Keyword
In the previous subproblem, we noted that there is one word with a much higher frequency than the other words. Repeat the steps to load and pre-process the corpus, this time removing the most frequent word in addition to all elements of stopwords("english") in the call to tm_map with removeWords. For a refresher on how to remove this additional word, see the Twitter text analytics lecture.

Replace allTweets with the document-term matrix of this new corpus -- we will use this updated corpus for the remainder of the assignment.

Create a word cloud with the updated corpus. 
```{r}
corpus = Corpus(VectorSource(tw$Tweet))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, 
                c('apple',stopwords("english")) )
dtm = DocumentTermMatrix(corpus)
# iphone
```
_What is the most common word in this new corpus (the largest word in the outputted word cloud)?_
```{r fig.height=6, fig.width=6}
wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.5))
```
<br><hr>

### Size and Color
So far, the word clouds we've built have not been too visually appealing -- they are crowded by having too many words displayed, and they don't take advantage of color. One important step to building visually appealing visualizations is to experiment with the parameters available, which in this case can be viewed by typing ?wordcloud in your R console. In this problem, you should look through the help page and experiment with different parameters to answer the questions.

Below are four word clouds, each of which uses different parameter settings in the call to the wordcloud() function:

**Word Cloud A:**
```{r fig.height=6, fig.width=6}
wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.4),
          rot.per=0.5) 
```

**Word Cloud B:**
```{r fig.height=4, fig.width=4}
wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.4),
          min.freq=8, random.order=F)     # B
```

**Word Cloud C:**
```{r fig.height=4, fig.width=4}
dtm1 = dtm[tw$Avg <= -1,]
wordcloud(colnames(dtm1), col_sums(dtm1), scale=c(3, 0.4),
          colors = brewer.pal(9,"Purples")[6:9] ) # C
```

**Word Cloud D:**
```{r fig.height=4, fig.width=4}
wordcloud(colnames(dtm), col_sums(dtm), scale=c(3, 0.4),
  min.freq=8, random.order=F, random.color=T,
  colors = brewer.pal(9,"Purples")[6:9] ) # D
```

##### 3.1 
_Which word cloud is based only on the negative tweets (tweets with Avg value -1 or less)?_

+ Word Cloud C 

##### 3.2 
Only one word cloud was created without modifying parameters min.freq or max.words. Which word cloud is this?_

+ Word Cloud A

##### 3.3
_Which word clouds were created with parameter random.order set to FALSE?_

+ Word Cloud B
+ Word Cloud D

##### 3.4
_Which word cloud was built with a non-default value for parameter rot.per?_

+ Word Cloud A

##### 3.5
In Word Cloud C and Word Cloud D, we provided a color palette ranging from light purple to dark purple as the parameter colors (you will learn how to make such a color palette later in this assignment). For _which word cloud was the parameter `random.color` set to `TRUE`?_

+ Word Cloud D

<br><hr>

### 4. Selecting a Color Palette
The use of a palette of colors can often improve the overall effect of a visualization. We can easily select our own colors when plotting; for instance, we could pass c("red", "green", "blue") as the colors parameter to wordcloud(). The RColorBrewer package, which is based on the ColorBrewer project (colorbrewer.org), provides pre-selected palettes that can lead to more visually appealing images. Though these palettes are designed specifically for coloring maps, we can also use them in our word clouds and other visualizations.

Begin by installing and loading the "RColorBrewer" package. This package may have already been installed and loaded when you installed and loaded the "wordcloud" package, in which case you don't need to go through this additional installation step. If you obtain errors (for instance, "Error: lazy-load database 'P' is corrupt") after installing and loading the RColorBrewer package and running some of the commands, try closing and re-opening R.

The function brewer.pal() returns color palettes from the ColorBrewer project when provided with appropriate parameters, and the function display.brewer.all() displays the palettes we can choose from.

##### 4.1 
_Which color palette would be most appropriate for use in a word cloud for which we want to use color to indicate word frequency?_
```{r fig.height=8}
library(RColorBrewer)
display.brewer.all()

# YlOrRd
```

##### 4.2
Which RColorBrewer palette name would be most appropriate to use when preparing an image for a document that must be in grayscale?
```{r}
display.brewer.pal(7, "Greys")
# Greys
```

##### 4.3 
n sequential palettes, sometimes there is an undesirably large contrast between the lightest and darkest colors. You can see this effect when plotting a word cloud for allTweets with parameter colors=brewer.pal(9, "Blues"), which returns a sequential blue palette with 9 colors.

_Which of the following commands addresses this issue by removing the first 4 elements of the 9-color palette of blue colors? Select all that apply._

```{r}
brewer.pal(9, "Blues")[c(5,6,7,8,9)] 
```

```{r}
brewer.pal(9, "Blues")[c(-1,-2,-3,-4)] 
```
<br>

- - -

<br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

