Email:
RPubs: https://rpubs.com/datazerotohero/
Github: https://github.com/datazerotohero/


1 Time Series

A time series is a set of quantitative values obtained at successive time points. The intervals between time points (e.g., hours, days, weeks, months, or years) are usually equal.

1.1 Line Graph

The most common time-dependent graph is the time-series line graph. Let’s consider the US Economics time series that come with the ggplot2 package. It contains US monthly economic data collected from January 1967 thru January 2015. Let’s plot the personal savings rate psavert. We can do this with a simple line plot.

library(ggplot2)                                  # data visualization 
data(economics, package = "ggplot2")              # load data
ggplot(economics, aes(x = date, y = psavert)) +   # assign x and y-axis from the dataset
  geom_line(color = "indianred",size=0.6) +       # add the line graph, color, and the size
  theme_minimal()+                                # the theme for a better of view
  labs(title = "Personal Savings Rate",           # the title for the graph
       x = "Date",                                # rename x-axis
       y = "Rate")                                # rename y-axis
Simple line graph

Simple line graph

The scale_x_date function can be used to reformat dates. In the graph below, tick marks appear every 5 years, and dates are presented in MMM-YY format. Additionally, the time series line is given an off-red color and made thicker, a trend line (loess), and titles are added, and the theme is simplified.

library(ggplot2)                                  # data visualization
library(scales)                                   # Scaling and Centering of Matrix-like objects
data(economics, package = "ggplot2")              # load data
ggplot(economics, aes(x = date, y = psavert)) +   # assign x and y-axis from the dataset
  geom_line(color = "indianred", size=0.6) +      # add the line graph, color, and the size
  geom_smooth(formula = y~x, method = 'loess') +  # the relationship graph between x and y
  theme_minimal()+                                # the theme for a better of view
  scale_x_date(date_breaks = '5 years',  
               labels = date_format("%b-%y")) +
  labs(title = "Personal Savings Rate",
       subtitle = "1967 to 2015",
       x = "",
       y = "Rate") 
Simple line graph with modified date axis

Simple line graph with modified date axis

Note: When plotting time series, be sure that the data variable is the class date and not class character. See date values for more details.

Let’s close this section with a multivariate time series (more than one series). We’ll compare closing prices for BBCA and BMRI from Jan 1, 2019 until today.

library(quantmod)                                 # quantitative financial modelling framework
library(dplyr)                                    # data manipulation
library(ggplot2)                                  # data visualization
# get PT Bank Central Asia Tbk (BBCA.JK) closing prices
BBCA <- getSymbols("BBCA.JK",src = "yahoo",
                    return.class = "data.frame", 
                    from="2019-01-01")
# proses the data that we want to analyze
BBCA <- BBCA.JK %>%                                 
  mutate(Date = as.Date(row.names(.))) %>%
  select(Date, BBCA.JK.Close) %>%
  rename(Close = BBCA.JK.Close) %>%
  mutate(Company = "Bank Central Asia Tbk")
# ---------------------------------------------------------------------------------------------------------
# get PT Bank Mandiri (Persero) Tbk (BMRI.JK) closing prices
BMRI <- getSymbols("BMRI.JK", src = "yahoo",
                       return.class = "data.frame", 
                       from="2019-01-01")
# proses the data that we want to analyze 
BMRI <- BMRI.JK %>% 
  mutate(Date = as.Date(row.names(.))) %>%
  select(Date, BMRI.JK.Close) %>%
  rename(Close = BMRI.JK.Close) %>%
  mutate(Company = "Bank Mandiri (Persero)")
# ---------------------------------------------------------------------------------------------------------
# combine data for both companies
mseries <- rbind(BBCA,BMRI)
# ---------------------------------------------------------------------------------------------------------
# Visualize the data
ggplot(mseries, aes(x=Date, y= Close, color=Company)) + 
  geom_line(size=1) +
  scale_x_date(date_breaks = '1 month', 
               labels = scales::date_format("%b-%y")) +
  scale_y_continuous(limits = c(3000, 40000), 
                     breaks = seq(120, 380, 10),
                     labels = scales::dollar) +
  labs(title = "Daily Closing Prices (BBCA & BMRI",
       subtitle = "Jan 2019 until today",
       caption = "source: Yahoo Finance",
       y = "Closing Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_color_brewer(palette = "Dark2")

1.2 Area Charts

A simple area chart is basically a line graph, with a fill from the line to the x-axis.

library(ggplot2)                                  # data visualization
data(economics, package = "ggplot2")              # load data
ggplot(economics, aes(x = date, y = psavert)) +
  geom_area(fill="azure3", color="indianred3") +
  theme_minimal() +
  labs(title = "Personal Savings Rate",
       x = "Date",
       y = "Personal Savings Rate")
Area charts

Area charts

A stacked area chart can be used to show differences between groups over time. Consider the uspopage data set from the gcookbook package. We’ll plot the age distribution of the US population between 1900 and 2002.

library(ggplot2)                                  # data visualization
library(gcookbook)                                # health and economic data about countries 
data(uspopage, package = "gcookbook")             # load data
ggplot(uspopage, aes(x = Year,
                     y = Thousands, 
                     fill = AgeGroup)) +
  geom_area() +
  theme_minimal()+
  labs(title = "US Population by age",
       x = "Year",
       y = "Population in Thousands")
Area charts with groups over time 1

Area charts with groups over time 1

It is best to avoid scientific notation in your graphs. How likely is it that the average reader will know that 3e+05 means 300,000,000? It is easy to change the scale in ggplot2. Simply divide the Thousands variable by 1000 and report it as Millions. While we are at it, let’s

  • create black borders to highlight the difference between groups
  • reverse the order the groups to match increasing age
  • improve labeling
  • choose a different color scheme

The levels of the AgeGroup variable can be reversed using the fct_rev function in the forcats package.

library(ggplot2)                                  # data visualization
library(gcookbook)                                # health and economic data about countries 
data(uspopage, package = "gcookbook")             # load data
ggplot(uspopage, aes(x = Year,
                     y = Thousands/1000, 
                     fill = forcats::fct_rev(AgeGroup))) +
  geom_area(color = "black") +
  labs(title = "US Population by age",
       subtitle = "1900 to 2002",
       caption = "source: U.S. Census Bureau, 2003, HS-3",
       x = "Year",
       y = "Population in Millions",
       fill = "Age Group") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()
Area charts with groups over time 2

Area charts with groups over time 2

Your challenge: Do the same thing to visualize data population Indonesia using economics dataset.

1.3 Dummbbell Charts

This special chart is useful for displaying change between two-time points for several groups or observations. The geom_dumbbell function from the ggalt package is used.

Using the gapminder data-set let’s plot the change in life expectancy from 1952 to 2007 in Asia. The data-set is in long format. We will need to convert it to a wide format in order to create the dumbbell plot. In this graph, we’ll sort by 1952 life expectancy, and modify the line and point size, color the points, add titles and labels, and simplify the theme.

library(ggalt)                                    # containing additional geoms,scales,etc 
library(dplyr)                                    # data manipulation
library(tidyr)                                    # tidy messy data
library(ggplot2)                                  # data visualization
data(gapminder, package = "gapminder")            # load data
# subset data
plotdata_long <- filter(gapminder,
                        continent == "Asia" &
                        year %in% c(1952, 2007)) %>%
  select(country, year, lifeExp)
# convert data to wide format
plotdata_wide <- spread(plotdata_long, year, lifeExp)
names(plotdata_wide) <- c("country", "y1952", "y2007")
# create dumbbell plot
ggplot(plotdata_wide, 
       aes(y = reorder(country, y1952),
           x = y1952,
           xend = y2007)) +  
  geom_dumbbell(size_x = 2,
                size = 0.8,
                size_xend = 2,
                colour_x = "aquamarine3", 
                colour = "cornflowerblue", 
                colour_xend = "indianred") +
  theme_minimal() + 
  labs(title = "Change of Life Expectancy in Asia",
       subtitle = "1952 to 2007",
       x = "Life Expectancy (years)",
       y = "")
Dummbbell charts

Dummbbell charts

It is easier to discern patterns here. For example Afganistan started with the lowest life expectancy in 1952 and still has the lowest in 2007. Israel started relatively high by has made few gains.

1.4 Slope Graphs

When there are several groups and several time points, a slope graph can be helpful. Let’s plot life expectancy for seven Southeast Asia countries in 1992, 1997, 2002, and 2007. Again we’ll use the gapminder data.

To create a slope graph, we’ll use the newggslopegraph function from the CGPfunctions package. The newggslopegraph function parameters are (in order)

  • Data frame
  • Time variable (which must be a factor)
  • Numeric variable to be plotted, and
  • Grouping variable (creating one line per group).
library(ggalt)                                    # containing additional geoms,scales,etc 
library(dplyr)                                    # data manipulation
library(tidyr)                                    # tidy messy data
library(ggplot2)                                  # data visualization
library(CGPfunctions)                             # miscellaneous functions (slope graph)
data(gapminder, package = "gapminder")            # load data
# Select seven Southeast Asia countries data 
# for 1992, 1997, 2002, and 2007
df <- gapminder %>%
  filter(year %in% c(1992, 1997, 2002, 2007) &
           country %in% c("Indonesia", "Singapore", 
                          "Malaysia", "Vietnam",
                          "Philippines","Thailand",
                          "Myanmar")) %>%
  mutate(year = factor(year),
         lifeExp = round(lifeExp)) 
# create slope graph
newggslopegraph(df, year, lifeExp, country) +
  labs(title="Life Expectancy by Country", 
       subtitle="Southeast Asia", 
       caption="source: gapminder")
Slope graphs

Slope graphs

In the graph above, Singapore has the highest life expectancy across the range of years. Myanmar has the lowest life expectancy until the end of the study.

2 Other Graphs

Graphs in this section can be very useful in your daily life as a data scientist, but this is might be not too easy within the other graphs you have learned so far.

2.1 Bubble Charts

A bubble plot is a scatterplot where a third dimension is added: the value of an additional numeric variable is represented through the size of the dots. You need 3 numerical variables as input: one is represented by the x-axis, one by the y-axis, and one by the dot size.

Here is an example using an abstract of the Gapminder dataset made famous through the Hans Rosling Ted Talk. It provides the average life expectancy, GDP per capita, and population size for more than 100 countries. This dataset is available through the gapminder R package.

library(gapminder)                                 # dataset
library(tidyverse)                                 # data manipulation
library(hrbrthemes)                                # for the `theme_ipsum()` and legend
library(ggplot2)                                   # data visualization
library(gridExtra)                                 # siscellaneous Functions for "Grid" Graphics
library(ggrepel)                                   # extra geoms for ggplot2
library(viridis)                                   # colour scales 
library(extrafont)                                 # font family not found in Windows font database
# font_import()                                    # import font family to your PC
loadfonts(device = "win")
# The dataset is provided in the gapminder library
data <- gapminder %>% 
  filter(year=="2007") %>% 
  dplyr::select(-year)
# Show a bubbleplot
data %>%
mutate(pop=pop/1000000) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country)) %>%
ggplot(aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
  geom_point(alpha=0.7) +
  scale_size(range = c(1, 15), name="Population (M)") +
  scale_color_viridis(discrete=TRUE, guide=FALSE) +
  theme_ipsum() +
  theme(legend.position="bottom")
Simple bubble plot

Simple bubble plot

In this chart, the relationship between gdp per capita and life Expectancy is quite obvious: rich countries tend to live longer, with a threshold effect when gdp per capita reaches ~10,000. This relationship could have been detected using a classic scatterplot, but the bubble size allows to the nuance of this result with the third level of information: the country population.

However, it can be frustrating not to know what are the countries in the extreme part of the graphic, or what are the one out of the general trend. As usual, annotating the graphic is a crucial step to make it insightful:

# Prepare data
tmp <- data %>%
mutate(annotation = case_when(
  gdpPercap > 5000 & lifeExp < 60 ~ "yes",
  lifeExp < 30 ~ "yes", 
  gdpPercap > 40000 ~ "yes")) %>%
mutate(pop=pop/1000000) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country))
# Plot
ggplot(tmp, aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
  geom_point(alpha=0.7) +
  scale_size(range = c(1.4, 19), name="Population (M)") +
  scale_color_viridis(discrete=TRUE) +
  theme_ipsum() +
  theme(legend.position="none") +
  geom_text_repel(data=tmp %>% filter(annotation=="yes"), aes(label=country), size=4 )
Bubble plot with the extreme part

Bubble plot with the extreme part

2.2 Biplots

A biplot is a specialized graph that attempts to represent the relationship between observations, between variables, and between observations and variables, in a low (usually two) dimensional space. It’s easiest to see how this works with an example. Let’s create a biplot for the mtcars data set, using the fviz_pca function from the factoextra package.

library(factoextra)                                # using the `fviz_pca` function
data(mtcars)                                       # load data
fit <- prcomp(x=mtcars, center=TRUE, scale=TRUE)   # fit a principal components model
fviz_pca(fit, repel = TRUE, labelsize = 3) +       # plot the results
      theme_bw() +
      labs(title = "Biplot of mtcars data")
Basic biplot

Basic biplot

Dim1 and Dim2 are the first two principal components - linear combinations of the original \(p\) variables.

\[\begin{eqnarray} PC1=\beta_{10}+\beta_{11}x_1+\beta_{12}x_2+\beta_{13}x_3+\cdots+\beta_{1p}x_p \\ PC2=\beta_{20}+\beta_{21}x_1+\beta_{22}x_2+\beta_{23}x_3+\cdots+\beta_{2p}x_p \end{eqnarray}\]

The weights of these linear combinations (\(\beta_{ij}s\)) are chosen to maximize the variance accounted for in the original variables. Additionally, the principal components (\(PC\)s) are constrained to be uncorrelated with each other.

In this graph, the first PC accounts for 60% of the variability in the original data. The second PC accounts for 24%. Together, they account for 84% of the variability in the original \(p = 11\) variables. As you can see, both the observations (cars) and variables (car characteristics) are plotted in the same graph.

  • Points represent observations. Smaller distances between points suggest similar values on the original set of variables. For example, the Toyota Corolla and Honda Civic are similar to each other, as are the Chrysler Imperial and Lincoln Continental. However, the Toyota Corolla is very different from the Lincoln Continental.
  • The vectors (arrows) represent variables. The angle between vectors is proportional to the correlation between the variables. Smaller angles indicate stronger correlations. For example, gear and am are positively correlated, gear and qsec are uncorrelated (90-degree angle), and am and wt are negatively correlated (angle greater than 90 degrees).
  • The observations that are are farthest along the direction of a variable’s vector, have the highest values on that variable. For example, the Toyota Corolla and Honda Civic have higher values on mpg. The Toyota Corona has a higher qsec. The Duster 360 has more cylinders. Care must be taken in interpreting biplots. They are only accurate when the percentage of variance accounted for is high. Always check your conclusion with the original data.

See the article by Forrest Young to learn more about interpreting biplots correctly.

2.3 Heatmaps

A heatmap displays a set of data using colored tiles for each variable value within each observation. Let’s create a heat-map for the mtcars data set that comes with base R. The mtcars data set contains information on 32 cars measured on 11 variables.

data(mtcars)
library(superheat)
superheat(mtcars,
          scale = TRUE,
          left.label.text.size=3,
          bottom.label.text.size=3,
          bottom.label.size = .05,
          row.dendrogram = TRUE )
Sorted heatmap

Sorted heatmap

Here we can see that the Toyota Corolla and Fiat 128 have similar characteristics. The Lincoln Continental and Cadillac Fleet-wood also have similar characteristics. The superheat function requires that the data be in a particular format. Specifically

  • the data most be all numeric
  • the row names are used to label the left axis. If the desired labels are in a column variable, * the variable must be converted to row names (more on this below)
  • missing values are allowed

Let’s use a heat-map to display changes in life expectation over time for Asian countries. The data come from the gapminder data set.

Since the data is in a long format, we first have to convert it to a wide format. Then we need to ensure that it is a data frame and convert the variable country into row names. Finally, we’ll sort the data by 2007 life expectancy. While we are at it, let’s change the color scheme.

library(tidyr)                                     # data manaipulation
library(dplyr)                                     # data manaipulation
library(RColorBrewer)                              # color scheme
library(superheat)                                 # headmat library
data(gapminder, package="gapminder")               # load data
asia <- gapminder %>%                              # subset Asian countries
  filter(continent == "Asia") %>%
  select(year, country, lifeExp)
plotdata <- spread(asia, year, lifeExp)            # convert to long to wide format
plotdata <- as.data.frame(plotdata)                # save country as row names
row.names(plotdata) <- plotdata$country
plotdata$country <- NULL
sort.order <- order(plotdata$"2007")               # row order
colors <- rev(brewer.pal(5, "Blues"))              # color 
# create the heat map
superheat(plotdata,
          scale = FALSE,
          left.label.text.size=3,
          bottom.label.text.size=3,
          bottom.label.size = .05,
          heat.pal = colors,
          order.rows = sort.order,
          title = "Life Expectancy in Asia")
 Heatmap for time series

Heatmap for time series

2.4 Correlation Plot Matrix

Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. In order to explore the relationships among the quantitative variables, we can calculate the Pearson Product-Moment correlation coefficients.

library(dplyr)                                    # data manipulation
library(ggplot2)                                  # data visualization
library(corrplot)                                 # visualization of a correlation matrix
## corrplot 0.84 loaded
library(RColorBrewer)                             # creates nice looking color palettes 
data(mtcars)                                      # load data 
df <- dplyr::select_if(mtcars,is.numeric)         # select numeric variables
corM <- cor(df, use="complete.obs")               # calulate the correlation matrix
round(corM,2)                                     # adjust the correlation in two decimals
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
# corrplot(corM)                                  # default visualization of correlation matrix
# corrplot(corM, method="circle", type="lower")   # visualization method "circle" type lower
# corrplot(corM, method="pie", type="upper")      # visualization method "pie" type upper
# corrplot(corM, method="number", type="upper",   # visualization method "number" type upper
#          tl.col="indianred2", number.cex=0.8)
# corrplot(corM, method="number", type="upper",   # correlation with `hclust` (clustering)
#          tl.col="indianred3",number.cex=0.8,
#          order="hclust")
corrplot(corM, method="number", type="upper",     # correlation with `hclust` dividing 4 colors
         number.cex=0.8, tl.col="indianred4", 
         order="hclust", 
         col=brewer.pal(n=4, name="RdBu"))
Correlation matrix plots

Correlation matrix plots

For those interested to draw this correlogram with their own data, here is the code of the function I adapted based on the corrplot() function from the {corrplot} package (thanks again to all contributors of this package):

dat <- mtcars[, c(1, 3:7)]
# devtools::install_github("laresbernardo/lares")
library(lares)
corr_cross(dat,       # name of dataset
  max_pvalue = 0.05,  # display only significant correlations (at 5% level)
  top = 10            # display top 10 couples of variables (by correlation coefficient)
)
Top 10 Ranked Correlation matrix

Top 10 Ranked Correlation matrix

Use the corr_var() function if you want to focus on the correlation of one variable against all others, and return the highest ones in a plot:

dat <- mtcars[, c(1, 3:7)]
corr_var(dat,              # name of dataset
  disp,                    # name of variable to focus on
  top = 5                  # display top 5 correlations
) 
Correlation of one variable against all others

Correlation of one variable against all others

2.5 Scatterplot Matrix

A scatter-plot matrix is a collection of scatter-plots organized as a grid. It is similar to a correlation plot but instead of displaying correlations, displays the underlying data.

We can create a scatter-plot matrix using the ggpairs function in the GGally package. In this case, we illustrate a scatter-plot matrix by examining the relationships between Mammal size and sleep characteristics. The data come from the msleep data set that ships with ggplot2. Brain weight and body weight are highly skewed (think mouse and elephant) so we’ll transform them to log brain weight and log body weight before creating the graph.

library(GGally)                                    # using the `ggpairs` function
library(dplyr)                                     # data manipulation
data(msleep, package="ggplot2")                    # load/prepare data
msleep <- na.omit(msleep)                          # remove missing values
df <- msleep %>% 
mutate(log_brainwt = log(brainwt),log_bodywt = log(bodywt)) %>%
select(log_brainwt, log_bodywt, sleep_total, sleep_rem)
# custom function for density plot
my_density <- function(data, mapping, ...){
  ggplot(data = data, mapping = mapping) + 
    geom_density(alpha = 0.5, fill = "indianred3", ...)
}
# custom function for scatterplot
my_scatter <- function(data, mapping, ...){
  ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 0.5,color = "indianred3") + 
    geom_smooth(method=lm, se=FALSE, ...)
}
# create scatterplot matrix
ggpairs(df, 
        lower=list(continuous = my_scatter), 
        diag = list(continuous = my_density)) +
  labs(title = "Mammal size and sleep characteristics") +
  theme_bw()
Scatterplot matrix

Scatterplot matrix

The code you see above just remains you to be able to write your functions provides a great deal of flexibility. Additionally, since the resulting plot is a ggplot2 graph, additional functions can be added to alter the theme, title, labels, etc. See the help for more details.

2.6 Radar charts

A radar chart (also called a spider or star chart) displays one or more groups or observations on three or more quantitative variables. Radar charts can be created with ggradar function in the ggradar package. Unfortunately, the package is not available on CRAN, so we have to install it from Github.

In the example below, we’ll compare dogs, pigs, and cows in terms of body size, brain size, and sleep characteristics (total sleep time, length of the sleep cycle, and amount of REM sleep). The data come from the Mammal Sleep data set.

# install.packages("devtools")
# devtools::install_github("ricardo-bion/ggradar")
library(ggplot2)                                   # dataset and visualization
library(dplyr)                                     # data manipulation
library(scales)                                    # scaling data      
library(ggradar)                                   # radar visualization
data(msleep, package = "ggplot2")                  # prepare data 
plotdata <- msleep %>%
  filter(name %in% c("Cow", "Dog", "Pig")) %>%
  select(name, sleep_total, sleep_rem, 
         sleep_cycle, brainwt, bodywt) %>%
  rename(group = name) %>%
  mutate_at(vars(-group),
            funs(rescale))
# generate radar chart
ggradar(plotdata, 
        grid.label.size = 4,
        axis.label.size = 4, 
        group.point.size = 5,
        group.line.width = 1.5,
        legend.text.size= 10) +
  labs(title = "Mammals, size, and sleep")
Scatterplot matrix

Scatterplot matrix

We can see from the chart that, relatively speaking, cows have large brain and body weights, long sleep cycles, short total sleep time, and little time in REM sleep. Dogs in comparison, have a small body and brain weights, short sleep cycles, and large total sleep time and time in REM sleep (The obvious conclusion is that I want to be a dog - but with a bigger brain).

2.7 Word clouds

A word cloud (also called a tag cloud), is basically an infographic that indicates the frequency of words in a collection of text (e.g., tweets, a text document, a set of text documents). There is a very nice script produced by STHDA that will generate a word cloud directly from a text file. To demonstrate, we’ll process the “I have a dream speech” from “Martin Luther King”.

library("tm")                                      # for text mining
library("SnowballC")                               # for text stemming
library("wordcloud")                               # word-cloud generator 
library("RColorBrewer")                            # color palettes
text <- readLines("Data/speech.txt")               # Read the text file from your PC
docs <- Corpus(VectorSource(text))                 # Load the data as a corpus
# Text transformation
toSpace <- content_transformer(function(x, 
                pattern) gsub(pattern, " ", x))
docs <- tm_map(docs,toSpace, "/")
docs <- tm_map(docs,toSpace, "@")
docs <- tm_map(docs,toSpace, "\\|")
docs <- tm_map(docs,content_transformer(tolower)) # Convert the text to lower case
docs <- tm_map(docs,removeNumbers)                # Remove numbers
docs <- tm_map(docs,removeWords,                 
               stopwords("english"))              # Remove english common stopwords
docs <- tm_map(docs, removeWords, 
               c("blabla1", "blabla2"))           # specify your stopwords as a character vector
docs <- tm_map(docs, removePunctuation)           # Remove punctuations
docs <- tm_map(docs, stripWhitespace)             # Eliminate extra white spaces
# Build a term-document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# The importance of words can be illustrated as a word cloud as follow:
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
Word cloud

Word cloud