Email: siregarbakti@gmail.com
RPubs: https://rpubs.com/datazerotohero/
Github: https://github.com/datazerotohero/
A time series is a set of quantitative values obtained at successive time points. The intervals between time points (e.g., hours, days, weeks, months, or years) are usually equal.
The most common time-dependent graph is the time-series line graph. Let’s consider the US Economics time series that come with the ggplot2 package. It contains US monthly economic data collected from January 1967 thru January 2015. Let’s plot the personal savings rate psavert. We can do this with a simple line plot.
library(ggplot2) # data visualization
data(economics, package = "ggplot2") # load data
ggplot(economics, aes(x = date, y = psavert)) + # assign x and y-axis from the dataset
geom_line(color = "indianred",size=0.6) + # add the line graph, color, and the size
theme_minimal()+ # the theme for a better of view
labs(title = "Personal Savings Rate", # the title for the graph
x = "Date", # rename x-axis
y = "Rate") # rename y-axisSimple line graph
The scale_x_date function can be used to reformat dates. In the graph below, tick marks appear every 5 years, and dates are presented in MMM-YY format. Additionally, the time series line is given an off-red color and made thicker, a trend line (loess), and titles are added, and the theme is simplified.
library(ggplot2) # data visualization
library(scales) # Scaling and Centering of Matrix-like objects
data(economics, package = "ggplot2") # load data
ggplot(economics, aes(x = date, y = psavert)) + # assign x and y-axis from the dataset
geom_line(color = "indianred", size=0.6) + # add the line graph, color, and the size
geom_smooth(formula = y~x, method = 'loess') + # the relationship graph between x and y
theme_minimal()+ # the theme for a better of view
scale_x_date(date_breaks = '5 years',
labels = date_format("%b-%y")) +
labs(title = "Personal Savings Rate",
subtitle = "1967 to 2015",
x = "",
y = "Rate") Simple line graph with modified date axis
Note: When plotting time series, be sure that the data variable is the class date and not class character. See date values for more details.
Let’s close this section with a multivariate time series (more than one series). We’ll compare closing prices for BBCA and BMRI from Jan 1, 2019 until today.
library(quantmod) # quantitative financial modelling framework
library(dplyr) # data manipulation
library(ggplot2) # data visualization
# get PT Bank Central Asia Tbk (BBCA.JK) closing prices
BBCA <- getSymbols("BBCA.JK",src = "yahoo",
return.class = "data.frame",
from="2019-01-01")
# proses the data that we want to analyze
BBCA <- BBCA.JK %>%
mutate(Date = as.Date(row.names(.))) %>%
select(Date, BBCA.JK.Close) %>%
rename(Close = BBCA.JK.Close) %>%
mutate(Company = "Bank Central Asia Tbk")
# ---------------------------------------------------------------------------------------------------------
# get PT Bank Mandiri (Persero) Tbk (BMRI.JK) closing prices
BMRI <- getSymbols("BMRI.JK", src = "yahoo",
return.class = "data.frame",
from="2019-01-01")
# proses the data that we want to analyze
BMRI <- BMRI.JK %>%
mutate(Date = as.Date(row.names(.))) %>%
select(Date, BMRI.JK.Close) %>%
rename(Close = BMRI.JK.Close) %>%
mutate(Company = "Bank Mandiri (Persero)")
# ---------------------------------------------------------------------------------------------------------
# combine data for both companies
mseries <- rbind(BBCA,BMRI)
# ---------------------------------------------------------------------------------------------------------
# Visualize the data
ggplot(mseries, aes(x=Date, y= Close, color=Company)) +
geom_line(size=1) +
scale_x_date(date_breaks = '1 month',
labels = scales::date_format("%b-%y")) +
scale_y_continuous(limits = c(3000, 40000),
breaks = seq(120, 380, 10),
labels = scales::dollar) +
labs(title = "Daily Closing Prices (BBCA & BMRI",
subtitle = "Jan 2019 until today",
caption = "source: Yahoo Finance",
y = "Closing Price") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_color_brewer(palette = "Dark2")A simple area chart is basically a line graph, with a fill from the line to the x-axis.
library(ggplot2) # data visualization
data(economics, package = "ggplot2") # load data
ggplot(economics, aes(x = date, y = psavert)) +
geom_area(fill="azure3", color="indianred3") +
theme_minimal() +
labs(title = "Personal Savings Rate",
x = "Date",
y = "Personal Savings Rate")Area charts
A stacked area chart can be used to show differences between groups over time. Consider the uspopage data set from the gcookbook package. We’ll plot the age distribution of the US population between 1900 and 2002.
library(ggplot2) # data visualization
library(gcookbook) # health and economic data about countries
data(uspopage, package = "gcookbook") # load data
ggplot(uspopage, aes(x = Year,
y = Thousands,
fill = AgeGroup)) +
geom_area() +
theme_minimal()+
labs(title = "US Population by age",
x = "Year",
y = "Population in Thousands")Area charts with groups over time 1
It is best to avoid scientific notation in your graphs. How likely is it that the average reader will know that 3e+05 means 300,000,000? It is easy to change the scale in ggplot2. Simply divide the Thousands variable by 1000 and report it as Millions. While we are at it, let’s
The levels of the AgeGroup variable can be reversed using the fct_rev function in the forcats package.
library(ggplot2) # data visualization
library(gcookbook) # health and economic data about countries
data(uspopage, package = "gcookbook") # load data
ggplot(uspopage, aes(x = Year,
y = Thousands/1000,
fill = forcats::fct_rev(AgeGroup))) +
geom_area(color = "black") +
labs(title = "US Population by age",
subtitle = "1900 to 2002",
caption = "source: U.S. Census Bureau, 2003, HS-3",
x = "Year",
y = "Population in Millions",
fill = "Age Group") +
scale_fill_brewer(palette = "Set2") +
theme_minimal()Area charts with groups over time 2
Your challenge: Do the same thing to visualize data population Indonesia using economics dataset.
This special chart is useful for displaying change between two-time points for several groups or observations. The geom_dumbbell function from the ggalt package is used.
Using the gapminder data-set let’s plot the change in life expectancy from 1952 to 2007 in Asia. The data-set is in long format. We will need to convert it to a wide format in order to create the dumbbell plot. In this graph, we’ll sort by 1952 life expectancy, and modify the line and point size, color the points, add titles and labels, and simplify the theme.
library(ggalt) # containing additional geoms,scales,etc
library(dplyr) # data manipulation
library(tidyr) # tidy messy data
library(ggplot2) # data visualization
data(gapminder, package = "gapminder") # load data
# subset data
plotdata_long <- filter(gapminder,
continent == "Asia" &
year %in% c(1952, 2007)) %>%
select(country, year, lifeExp)
# convert data to wide format
plotdata_wide <- spread(plotdata_long, year, lifeExp)
names(plotdata_wide) <- c("country", "y1952", "y2007")
# create dumbbell plot
ggplot(plotdata_wide,
aes(y = reorder(country, y1952),
x = y1952,
xend = y2007)) +
geom_dumbbell(size_x = 2,
size = 0.8,
size_xend = 2,
colour_x = "aquamarine3",
colour = "cornflowerblue",
colour_xend = "indianred") +
theme_minimal() +
labs(title = "Change of Life Expectancy in Asia",
subtitle = "1952 to 2007",
x = "Life Expectancy (years)",
y = "")Dummbbell charts
It is easier to discern patterns here. For example Afganistan started with the lowest life expectancy in 1952 and still has the lowest in 2007. Israel started relatively high by has made few gains.
When there are several groups and several time points, a slope graph can be helpful. Let’s plot life expectancy for seven Southeast Asia countries in 1992, 1997, 2002, and 2007. Again we’ll use the gapminder data.
To create a slope graph, we’ll use the newggslopegraph function from the CGPfunctions package. The newggslopegraph function parameters are (in order)
library(ggalt) # containing additional geoms,scales,etc
library(dplyr) # data manipulation
library(tidyr) # tidy messy data
library(ggplot2) # data visualization
library(CGPfunctions) # miscellaneous functions (slope graph)
data(gapminder, package = "gapminder") # load data
# Select seven Southeast Asia countries data
# for 1992, 1997, 2002, and 2007
df <- gapminder %>%
filter(year %in% c(1992, 1997, 2002, 2007) &
country %in% c("Indonesia", "Singapore",
"Malaysia", "Vietnam",
"Philippines","Thailand",
"Myanmar")) %>%
mutate(year = factor(year),
lifeExp = round(lifeExp))
# create slope graph
newggslopegraph(df, year, lifeExp, country) +
labs(title="Life Expectancy by Country",
subtitle="Southeast Asia",
caption="source: gapminder")Slope graphs
In the graph above, Singapore has the highest life expectancy across the range of years. Myanmar has the lowest life expectancy until the end of the study.
Graphs in this section can be very useful in your daily life as a data scientist, but this is might be not too easy within the other graphs you have learned so far.
A bubble plot is a scatterplot where a third dimension is added: the value of an additional numeric variable is represented through the size of the dots. You need 3 numerical variables as input: one is represented by the x-axis, one by the y-axis, and one by the dot size.
Here is an example using an abstract of the Gapminder dataset made famous through the Hans Rosling Ted Talk. It provides the average life expectancy, GDP per capita, and population size for more than 100 countries. This dataset is available through the gapminder R package.
library(gapminder) # dataset
library(tidyverse) # data manipulation
library(hrbrthemes) # for the `theme_ipsum()` and legend
library(ggplot2) # data visualization
library(gridExtra) # siscellaneous Functions for "Grid" Graphics
library(ggrepel) # extra geoms for ggplot2
library(viridis) # colour scales
library(extrafont) # font family not found in Windows font database
# font_import() # import font family to your PC
loadfonts(device = "win")
# The dataset is provided in the gapminder library
data <- gapminder %>%
filter(year=="2007") %>%
dplyr::select(-year)
# Show a bubbleplot
data %>%
mutate(pop=pop/1000000) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country)) %>%
ggplot(aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7) +
scale_size(range = c(1, 15), name="Population (M)") +
scale_color_viridis(discrete=TRUE, guide=FALSE) +
theme_ipsum() +
theme(legend.position="bottom")Simple bubble plot
In this chart, the relationship between gdp per capita and life Expectancy is quite obvious: rich countries tend to live longer, with a threshold effect when gdp per capita reaches ~10,000. This relationship could have been detected using a classic scatterplot, but the bubble size allows to the nuance of this result with the third level of information: the country population.
However, it can be frustrating not to know what are the countries in the extreme part of the graphic, or what are the one out of the general trend. As usual, annotating the graphic is a crucial step to make it insightful:
# Prepare data
tmp <- data %>%
mutate(annotation = case_when(
gdpPercap > 5000 & lifeExp < 60 ~ "yes",
lifeExp < 30 ~ "yes",
gdpPercap > 40000 ~ "yes")) %>%
mutate(pop=pop/1000000) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country))
# Plot
ggplot(tmp, aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7) +
scale_size(range = c(1.4, 19), name="Population (M)") +
scale_color_viridis(discrete=TRUE) +
theme_ipsum() +
theme(legend.position="none") +
geom_text_repel(data=tmp %>% filter(annotation=="yes"), aes(label=country), size=4 )Bubble plot with the extreme part
A biplot is a specialized graph that attempts to represent the relationship between observations, between variables, and between observations and variables, in a low (usually two) dimensional space. It’s easiest to see how this works with an example. Let’s create a biplot for the mtcars data set, using the fviz_pca function from the factoextra package.
library(factoextra) # using the `fviz_pca` function
data(mtcars) # load data
fit <- prcomp(x=mtcars, center=TRUE, scale=TRUE) # fit a principal components model
fviz_pca(fit, repel = TRUE, labelsize = 3) + # plot the results
theme_bw() +
labs(title = "Biplot of mtcars data")Basic biplot
Dim1 and Dim2 are the first two principal components - linear combinations of the original \(p\) variables.
\[\begin{eqnarray} PC1=\beta_{10}+\beta_{11}x_1+\beta_{12}x_2+\beta_{13}x_3+\cdots+\beta_{1p}x_p \\ PC2=\beta_{20}+\beta_{21}x_1+\beta_{22}x_2+\beta_{23}x_3+\cdots+\beta_{2p}x_p \end{eqnarray}\]
The weights of these linear combinations (\(\beta_{ij}s\)) are chosen to maximize the variance accounted for in the original variables. Additionally, the principal components (\(PC\)s) are constrained to be uncorrelated with each other.
In this graph, the first PC accounts for 60% of the variability in the original data. The second PC accounts for 24%. Together, they account for 84% of the variability in the original \(p = 11\) variables. As you can see, both the observations (cars) and variables (car characteristics) are plotted in the same graph.
See the article by Forrest Young to learn more about interpreting biplots correctly.
A heatmap displays a set of data using colored tiles for each variable value within each observation. Let’s create a heat-map for the mtcars data set that comes with base R. The mtcars data set contains information on 32 cars measured on 11 variables.
data(mtcars)
library(superheat)
superheat(mtcars,
scale = TRUE,
left.label.text.size=3,
bottom.label.text.size=3,
bottom.label.size = .05,
row.dendrogram = TRUE )Sorted heatmap
Here we can see that the Toyota Corolla and Fiat 128 have similar characteristics. The Lincoln Continental and Cadillac Fleet-wood also have similar characteristics. The superheat function requires that the data be in a particular format. Specifically
Let’s use a heat-map to display changes in life expectation over time for Asian countries. The data come from the gapminder data set.
Since the data is in a long format, we first have to convert it to a wide format. Then we need to ensure that it is a data frame and convert the variable country into row names. Finally, we’ll sort the data by 2007 life expectancy. While we are at it, let’s change the color scheme.
library(tidyr) # data manaipulation
library(dplyr) # data manaipulation
library(RColorBrewer) # color scheme
library(superheat) # headmat library
data(gapminder, package="gapminder") # load data
asia <- gapminder %>% # subset Asian countries
filter(continent == "Asia") %>%
select(year, country, lifeExp)
plotdata <- spread(asia, year, lifeExp) # convert to long to wide format
plotdata <- as.data.frame(plotdata) # save country as row names
row.names(plotdata) <- plotdata$country
plotdata$country <- NULL
sort.order <- order(plotdata$"2007") # row order
colors <- rev(brewer.pal(5, "Blues")) # color
# create the heat map
superheat(plotdata,
scale = FALSE,
left.label.text.size=3,
bottom.label.text.size=3,
bottom.label.size = .05,
heat.pal = colors,
order.rows = sort.order,
title = "Life Expectancy in Asia")Heatmap for time series
Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. In order to explore the relationships among the quantitative variables, we can calculate the Pearson Product-Moment correlation coefficients.
library(dplyr) # data manipulation
library(ggplot2) # data visualization
library(corrplot) # visualization of a correlation matrix## corrplot 0.84 loaded
library(RColorBrewer) # creates nice looking color palettes
data(mtcars) # load data
df <- dplyr::select_if(mtcars,is.numeric) # select numeric variables
corM <- cor(df, use="complete.obs") # calulate the correlation matrix
round(corM,2) # adjust the correlation in two decimals## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
# corrplot(corM) # default visualization of correlation matrix
# corrplot(corM, method="circle", type="lower") # visualization method "circle" type lower
# corrplot(corM, method="pie", type="upper") # visualization method "pie" type upper
# corrplot(corM, method="number", type="upper", # visualization method "number" type upper
# tl.col="indianred2", number.cex=0.8)
# corrplot(corM, method="number", type="upper", # correlation with `hclust` (clustering)
# tl.col="indianred3",number.cex=0.8,
# order="hclust")
corrplot(corM, method="number", type="upper", # correlation with `hclust` dividing 4 colors
number.cex=0.8, tl.col="indianred4",
order="hclust",
col=brewer.pal(n=4, name="RdBu"))Correlation matrix plots
For those interested to draw this correlogram with their own data, here is the code of the function I adapted based on the corrplot() function from the {corrplot} package (thanks again to all contributors of this package):
dat <- mtcars[, c(1, 3:7)]
# devtools::install_github("laresbernardo/lares")
library(lares)
corr_cross(dat, # name of dataset
max_pvalue = 0.05, # display only significant correlations (at 5% level)
top = 10 # display top 10 couples of variables (by correlation coefficient)
)Top 10 Ranked Correlation matrix
Use the corr_var() function if you want to focus on the correlation of one variable against all others, and return the highest ones in a plot:
dat <- mtcars[, c(1, 3:7)]
corr_var(dat, # name of dataset
disp, # name of variable to focus on
top = 5 # display top 5 correlations
) Correlation of one variable against all others
A scatter-plot matrix is a collection of scatter-plots organized as a grid. It is similar to a correlation plot but instead of displaying correlations, displays the underlying data.
We can create a scatter-plot matrix using the ggpairs function in the GGally package. In this case, we illustrate a scatter-plot matrix by examining the relationships between Mammal size and sleep characteristics. The data come from the msleep data set that ships with ggplot2. Brain weight and body weight are highly skewed (think mouse and elephant) so we’ll transform them to log brain weight and log body weight before creating the graph.
library(GGally) # using the `ggpairs` function
library(dplyr) # data manipulation
data(msleep, package="ggplot2") # load/prepare data
msleep <- na.omit(msleep) # remove missing values
df <- msleep %>%
mutate(log_brainwt = log(brainwt),log_bodywt = log(bodywt)) %>%
select(log_brainwt, log_bodywt, sleep_total, sleep_rem)
# custom function for density plot
my_density <- function(data, mapping, ...){
ggplot(data = data, mapping = mapping) +
geom_density(alpha = 0.5, fill = "indianred3", ...)
}
# custom function for scatterplot
my_scatter <- function(data, mapping, ...){
ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.5,color = "indianred3") +
geom_smooth(method=lm, se=FALSE, ...)
}
# create scatterplot matrix
ggpairs(df,
lower=list(continuous = my_scatter),
diag = list(continuous = my_density)) +
labs(title = "Mammal size and sleep characteristics") +
theme_bw()Scatterplot matrix
The code you see above just remains you to be able to write your functions provides a great deal of flexibility. Additionally, since the resulting plot is a ggplot2 graph, additional functions can be added to alter the theme, title, labels, etc. See the help for more details.
A radar chart (also called a spider or star chart) displays one or more groups or observations on three or more quantitative variables. Radar charts can be created with ggradar function in the ggradar package. Unfortunately, the package is not available on CRAN, so we have to install it from Github.
In the example below, we’ll compare dogs, pigs, and cows in terms of body size, brain size, and sleep characteristics (total sleep time, length of the sleep cycle, and amount of REM sleep). The data come from the Mammal Sleep data set.
# install.packages("devtools")
# devtools::install_github("ricardo-bion/ggradar")
library(ggplot2) # dataset and visualization
library(dplyr) # data manipulation
library(scales) # scaling data
library(ggradar) # radar visualization
data(msleep, package = "ggplot2") # prepare data
plotdata <- msleep %>%
filter(name %in% c("Cow", "Dog", "Pig")) %>%
select(name, sleep_total, sleep_rem,
sleep_cycle, brainwt, bodywt) %>%
rename(group = name) %>%
mutate_at(vars(-group),
funs(rescale))
# generate radar chart
ggradar(plotdata,
grid.label.size = 4,
axis.label.size = 4,
group.point.size = 5,
group.line.width = 1.5,
legend.text.size= 10) +
labs(title = "Mammals, size, and sleep")Scatterplot matrix
We can see from the chart that, relatively speaking, cows have large brain and body weights, long sleep cycles, short total sleep time, and little time in REM sleep. Dogs in comparison, have a small body and brain weights, short sleep cycles, and large total sleep time and time in REM sleep (The obvious conclusion is that I want to be a dog - but with a bigger brain).
A word cloud (also called a tag cloud), is basically an infographic that indicates the frequency of words in a collection of text (e.g., tweets, a text document, a set of text documents). There is a very nice script produced by STHDA that will generate a word cloud directly from a text file. To demonstrate, we’ll process the “I have a dream speech” from “Martin Luther King”.
library("tm") # for text mining
library("SnowballC") # for text stemming
library("wordcloud") # word-cloud generator
library("RColorBrewer") # color palettes
text <- readLines("Data/speech.txt") # Read the text file from your PC
docs <- Corpus(VectorSource(text)) # Load the data as a corpus
# Text transformation
toSpace <- content_transformer(function(x,
pattern) gsub(pattern, " ", x))
docs <- tm_map(docs,toSpace, "/")
docs <- tm_map(docs,toSpace, "@")
docs <- tm_map(docs,toSpace, "\\|")
docs <- tm_map(docs,content_transformer(tolower)) # Convert the text to lower case
docs <- tm_map(docs,removeNumbers) # Remove numbers
docs <- tm_map(docs,removeWords,
stopwords("english")) # Remove english common stopwords
docs <- tm_map(docs, removeWords,
c("blabla1", "blabla2")) # specify your stopwords as a character vector
docs <- tm_map(docs, removePunctuation) # Remove punctuations
docs <- tm_map(docs, stripWhitespace) # Eliminate extra white spaces
# Build a term-document matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# The importance of words can be illustrated as a word cloud as follow:
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Word cloud