This notes attempts to provide some essential details for learning visual summaries of a data set using plotly package in R. The learning methodology is explained through the knowledge of nature of a variable and possible list of plots associated with the variable. Also we shall note the ways to improve the presentation of a plot by enhancing necessary components of a plot.
Codes may be obtained using the tab appearing in the top right of each output
library(kableExtra)
library(DT)
p1=c("Plot 1","Plot 2","Plot 3","Plot 4")
p2=c("1-D","2-D More","2-D More","2-D More")
p3=c("Numeric","Numeric-Grouped by Factor (Dichot/Polychot)","Numeric and Numeric","Binary (Dichot/Polychot)")
p4=c("Histogram, Box plot, Density, Scatter Plot","Histogram, Box plot, Density, Scatter Plot","Scatter Plot","Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut")
plda=as.data.frame(cbind(p1,p2,p3,p4))
colnames(plda)=c("Plots","Dimension","Variable Type","Plot Type")
kable(rbind(plda)) %>%
kable_styling(bootstrap_options ='striped')| Plots | Dimension | Variable Type | Plot Type |
|---|---|---|---|
| Plot 1 | 1-D | Numeric | Histogram, Box plot, Density, Scatter Plot |
| Plot 2 | 2-D More | Numeric-Grouped by Factor (Dichot/Polychot) | Histogram, Box plot, Density, Scatter Plot |
| Plot 3 | 2-D More | Numeric and Numeric | Scatter Plot |
| Plot 4 | 2-D More | Binary (Dichot/Polychot) | Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut |
We shall make use of midwest dataset from ggplot2 package. We shall load plotly package for visualization.
library(plotly)
library(dplyr)
library(GGally)
data(midwest,package='ggplot2')Let us look at the nature of variables in midwest data
csl=1:length(midwest)
cn=names(midwest)
ct=c(rep("Character",2),
"Factor",
rep("Numeric",23),
rep("Factor",2))
cv=c(rep("-",2),
"Bar/Pie",
rep("Scatter/Histogram/Box/Density",23),
rep("Bar/Pie",2))
cnu=c(rep("-",2),
"Count/Proportion",
rep("Mean/Median/Percentiles/Variance",23),
rep("Count/Proportion",2))
meta=data.frame(
sl.no=csl,Variable=cn,Type=ct,
Visual_summary=cv,Numerical_summary=cnu
)
kbl(meta) %>% kable_material_dark()| sl.no | Variable | Type | Visual_summary | Numerical_summary |
|---|---|---|---|---|
| 1 | PID | Character |
|
|
| 2 | county | Character |
|
|
| 3 | state | Factor | Bar/Pie | Count/Proportion |
| 4 | area | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 5 | poptotal | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 6 | popdensity | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 7 | popwhite | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 8 | popblack | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 9 | popamerindian | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 10 | popasian | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 11 | popother | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 12 | percwhite | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 13 | percblack | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 14 | percamerindan | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 15 | percasian | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 16 | percother | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 17 | popadults | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 18 | perchsd | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 19 | percollege | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 20 | percprof | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 21 | poppovertyknown | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 22 | percpovertyknown | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 23 | percbelowpoverty | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 24 | percchildbelowpovert | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 25 | percadultpoverty | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 26 | percelderlypoverty | Numeric | Scatter/Histogram/Box/Density | Mean/Median/Percentiles/Variance |
| 27 | inmetro | Factor | Bar/Pie | Count/Proportion |
| 28 | category | Factor | Bar/Pie | Count/Proportion |
plotly is an R package for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js. Any graph made with the plotly R package is powered by the JavaScript library plotly.js The plot_ly() function provides a ‘direct’ interface to plotly.js with some additional abstractions to help reduce typing. These abstractions, inspired by the Grammar of Graphics and ggplot2, make it much faster to iterate from one graphic to another, making it easier to discover interesting features in the data.
The plotly package takes a purely functional approach to a layered grammar of graphics. Plot_ly code gets converted in to R list then changes to JSON object then only we are getting chart. We can see the work-flow in the image below
In plotly.js terminology, a figure has two key components: data (aka, traces) and a layout.A trace defines a mapping from data and visuals. Every trace has a type (e.g., histogram, pie, scatter, etc.) and the trace type determines what other attributes (i.e., visual and/or interactive properties,like x, hover info, name) are available to control the trace mapping.
We shall illustrate each type of plot (listed in the first table) with suitable Numeric / Factor variables; count variable can be treated as numeric for this purpose.
We shall consider Numeric variables and the suitable plots for understanding them.
We can attempt with Scatter Plot, histogram, Box plot, Density plot. For scatter plot with a single numeric variable we have to define values for both axes. So usually we consider index from 1 to number of rows of the data set as x axis value; y axis will be our (numeric) variable of interest
Considering population total Creating index from 1 to number of rows in the data, we use type argument to specify the plot type, which is scatter and mode to mention whether we want just points or lines over the points
Scatter plot
x=1:nrow(midwest)
plot_ly(midwest, x = ~x, y= ~poptotal,
type = "scatter",mode='markers')marker argument can be used to specify color,size and symbol of markers in scatter plot, which can be given as a list.
plot_ly(midwest, x = ~x, y= ~poptotal, type = "scatter",
mode='markers',
marker = list(color = "red",
size=15,symbol='10',
line = list(color = 'yellow',
width = 2)))Or we can specify the aesthetics directly as below using I() for mentioning color and symbol.
plot_ly(midwest,x=~x,y=~poptotal,
symbol=I('10'),
color=I("indianred4"),
size=10)More options using layout() for formatting axis labels, title and ticks
Giving title and axes labels
plot_ly(midwest,x=~x,y=~poptotal,
symbol=I('10'),
color=I("indianred4"),
size=10) %>%
layout(title = 'Midwest analysis',xaxis = list(title= "Index"),yaxis = list(title = "Population total"))Title Aesthetics
font,tickfontand tickangle arguments are used to customize title aesthetics
plot_ly(midwest,x=~x,y=~poptotal,
symbol=I('10'),
color=I("indianred4"),
size=10) %>%layout(title = 'Midwest analysis',
font = list(
family = "Cursive", size=17, color = '#665D1E'),
xaxis = list(
title= "Index",
color = 'green',tickangle = -45,
tickfont = list(
family = "times", size = 14,color='orange')
),
yaxis = list(title = "Population total")) Modifying background
Background can be changed using arguments paper_bgcolor and plot_bgcolor
plot_ly(midwest,x=~x,y=~poptotal,
symbol=I('10'),
color=I("indianred4"),
size=10) %>%layout(title = 'Midwest analysis',
font = list(
family = "times", size=17, color = '#665D1E'),
xaxis = list(
title= "Index",
color = 'green',tickangle = -45,
tickfont = list(
family = "times", size = 14,color='orange')
),
yaxis = list(title = "Population total"),
paper_bgcolor = 'pink',
plot_bgcolor = 'moccasin'
) Contextual Change
Depending on the context of the data set / variable of interest, we may be interested to consider a partial set of values. Or, we may use mathematical transformations (for example, logarithm) of a variable. This can be done using range argument. Let us consider population total below 10,000
plot_ly(midwest,x=~x,y=~poptotal,
symbol=I('10'),
color=I("indianred4"),
size=10) %>%
layout(title = 'Midwest Poptotal',xaxis = list(title= "Index"),yaxis = list(title = "Population total",range=c(0,10000))) Customizing difference between ticks Some times we might wish to specify tick marks based on context, this can be done as below
dtick is used to give difference between ticks
plot_ly(midwest,x=~x,y=~poptotal,
symbol=I('10'),
color=I("indianred4"),
size=10) %>%
layout(xaxis = list(dtick = 20,
tickmode = "linear"))Alphabetical ticks
ticklist is used to specify the alphabets/words we wish to represent and tickvals is used to mention their position
plot_ly(midwest,x=~x,y=~area,
symbol=I('10'),
color=I("indianred4"),
size=10) %>%
layout(
xaxis = list(
ticktext = list(
"Hundred", "Two hundred", "Three hundred", "Four hundred"),
tickvals = c(100,200,300,400),
tickmode = "array"),
yaxis = list(
ticktext = list(
"a","b","c","d","e","f"),
tickvals = c(0.02,0.04,0.06,0.08,0.1,0.12),
tickmode = "array"))Line chart
plot_ly(midwest, x = ~x, y = ~area,
type = 'scatter', mode = 'lines')Lines with markers with specs
plot_ly(midwest, x = ~x, y = ~area,
type = 'scatter', mode = 'lines+markers',
marker = list(color = 'rgba(50, 70, 193, .9)',
size=8,symbol=I('5'),
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2))) Box plot
type is specified as box,stroke argument is used to change outline color
plot_ly(midwest, y = ~area, type = 'box',color=I("gold"),stroke=I('blue'))Histogram
plot_ly(midwest, x = ~area, type = 'histogram',color=I("gold"),stroke=I('blue')) Density Plot
We find density first and then use the results to form line chart with area fill
# Finding density
d=density(midwest$percadultpoverty)
## X axis-->x component of d
## Y axis-->y component of d
plot_ly(x=d$x,y=d$y,
type='scatter',mode='lines',
fill='tozeroy',color =I('red')) Box plot based on Factor
color argument is used to specify Factor variable
plot_ly(midwest, y = ~area, type = 'box',color=~state,stroke=I('goldenrod'))Histogram based on Factor
plot_ly(midwest, x = ~area, type = 'histogram',color=~state,stroke=I('aliceblue'))Line chart based on Factor
plot_ly(midwest, x = ~x, y = ~area,color=~inmetro,
type = 'scatter', mode = 'lines') Density Plot based on Factor
For this we can find densities separately for each category of Factor variable and then add each density plot as trace above existing one
mid1 <- midwest %>% filter(inmetro=="0") %>% droplevels()
density1 <- density(mid1$poptotal)
mid2 <- midwest %>% filter(inmetro=="1") %>% droplevels()
density2 <- density(mid2$poptotal)
original_plot1 <- plot_ly(x = ~density1$x,
y = ~density1$y,
type = 'scatter',
mode = 'lines',
name = 'inmetro==0',
fill = 'tozeroy')
final_plot <- original_plot1 %>% add_trace(x = ~density2$x,
y = ~density2$y,
name = 'inmetro==1',
fill = 'tozeroy')
final_plotA Metric variable categorized by factor variables (binary / polychotomous) and paired with one or more numeric variables.
Color, size, and / or shape can be used to plot this option Instead of specifying single color we give the factor variable based on which we wish to analyze numeric variable. colors argument can be used to specify color palette
plot_ly(midwest,
x = ~x,
y= ~percbelowpoverty,
color=~state,
type = "scatter",
colors='Set2', mode='markers'
) %>%
layout(title = 'Percbelowpoverty based on State',xaxis = list(title= "Index"),yaxis = list(title = "Percbelowpoverty"))Using symbol we can add another variable for example ‘inmetro’
plot_ly(midwest,
x = ~x,
y= ~percbelowpoverty,
color=~state,
type = "scatter",
symbol=~inmetro,
colors='Set2', mode='markers'
) %>%
layout(title = 'Percbelowpoverty based on State and inmetro',xaxis = list(title= "Index"),yaxis = list(title = "Percbelowpoverty"))we can add a numeric variable using size
plot_ly(midwest,
x = ~x,
y= ~percbelowpoverty,
color=~state,
type = "scatter",
symbol=~inmetro,
colors='Set2', mode='markers',
size=~area
) %>%
layout(title = 'Percbelowpoverty based on State and inmetro',xaxis = list(title= "Index"),yaxis = list(title = "Percbelowpoverty"))We shall consider Factor variables and the suitable plots for understanding them.
Pie
plot_ly(midwest,labels=~state,type="pie")Customizing text and its position
textposition and textinfo are used to specify position of text whether outside or inside(which is default) and text information whether we want just label or percent or both
plot_ly(midwest,labels=~state,type="pie",textposition = 'outside',
textinfo = 'label+percent')Customizing color and text orientation and wedge width and color
marker argument is used to customize the colors in pie, line is used to customize color and width of wedge and insidetextorientation is used to specify text orientation
plot_ly(midwest,labels=~state,type="pie",
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = 'darkred',size=20),
marker = list(colors =c("aliceblue","green","blue","red","gold") ,
line = list(color = 'pink', width = 5)),
insidetextorientation='radial') Donut
hole argument is used to specify the donut inside hole size
plot_ly(marker=list(
colors=c("goldenrod","yellow","black","mistyrose","skyblue"))
)%>%
add_pie(data=count(midwest,state),
labels=~state,
values = ~n,
hole=.4
)%>%
layout(title="Donut of Midwest states",
font = list(family = "Cursive",
size=17, color = 'Orange'))Rotating the plot rotaion is used to rotate the plot to specified degrees
plot_ly(rotation=180,marker=list(
colors=c("goldenrod","yellow","black","mistyrose","skyblue"))
)%>%
add_pie(data=count(midwest,state),
labels=~state,
values = ~n,
hole=0.5
)%>%
layout(title="Midwest analysis",
font = list(family = "Cursive",
size=17,
color = 'Orange'))Horizontal Bar Plot
midwest %>%
count(state) %>%
plot_ly(x = ~n,
y = ~state,
type="bar") %>%
layout(title="Midwest analysis",
font = list(family = "sans serif",
size=17,
color = 'khaki'),
xaxis = list(title = "Number of Observations",
tickfont = list(
family = "Times",
size = 14,
color = 'darkgreen')),
yaxis = list(title = 'States',
tickfont = list(
family = "Times",
size = 14,
color = 'red')),
paper_bgcolor = 'green',
plot_bgcolor = 'whitesmoke')Bar Plot with text representing count
midwest %>%
count(inmetro) %>%
plot_ly(x=~inmetro, y=~n, color=I("darkblue"),type='bar',
text=~n,textposition='outside')Default plot is grouped bar Grouped bar
midwest %>%
count(inmetro,state) %>%
plot_ly(x=~state,y=~n,color=~inmetro,type='bar')Stacked bar
midwest %>%
count(inmetro,state) %>%
plot_ly(x=~state,y=~n,color=~inmetro,type='bar') %>%
layout(barmode='stack')Legend position
legend is used to specify position of legend in x and y axis
midwest %>%
count(inmetro,state) %>%
plot_ly(x=~state,y=~n,color=~inmetro,type='bar') %>%
layout(barmode='stack',legend=list(x=0,y=-.5))Hiding Legend
midwest %>%
count(inmetro,state) %>%
plot_ly(x=~state,y=~n,color=~inmetro,type='bar') %>%
layout(barmode='stack',showlegend=F)We use this subplot to arrange more than one plot to present in one time
p1 <- plot_ly(midwest, x = ~x, y= ~area,
type = "scatter",mode='markers')
p2 =midwest %>%
count(inmetro,state) %>%
plot_ly(x=~state,y=~n,color=~inmetro,type='bar')
p8 <- plot_ly(midwest, x = ~x, y = ~percbelowpoverty,
type = 'scatter',
mode = 'lines',
line = list(color = 'rgb(205, 12, 24)', width = 2), name = 'Percentage Below Poverty')%>%
add_trace(y = ~percadultpoverty,
line = list(color = 'rgb(22, 96, 167)',
width = 2, dash = 'dash'),
name = 'Percentage Adult Poverty')
ph=plot_ly(midwest,
x =~percbelowpoverty,
type = 'histogram',
color=~inmetro,
colors=c(I("red"),I('blue'))
)
## default
subplot(p1,p2,p8,ph)Specifying number of rows
subplot(p1,p2,p8,ph,nrows=2)Specifying about Axie titles
subplot(p1,p2,p8,ph,nrows=2,titleX = T)We use group_by and do to create plots for each category of factor variable and we then use subplot to obtain the desired output
midwest%>%
group_by(state) %>%
do(p=plot_ly(., x = ~area, y = ~popdensity, color = ~state,colors="Set2",
type = "scatter",mode='markers')) %>%
subplot(nrows = 1, shareX = TRUE, shareY = TRUE)We can create and save a plot using ggplot2 and related packages and can convert it to plotly object. For example we can obtain corrplot using GGally, convert it to plotly diagram
g1 <- ggpairs(midwest,columns=c(4:8), # included only few variables for better illustration
lower = list(continuous = wrap("points",
color ="darkgreen",alpha = 0.25,size=2,shape=5)),
diag = list(continuous=wrap("densityDiag",color = "red",
fill="yellow")))
ggplotly(g1)This notes gives an outline of important plots which can be used for visualizing numeric as well as factor variables for understanding the data.
Now we may be ready to generate plots using the plotly and other necessary packages.
That is we may be able to
Know the different plots based on the nature of the variables
Create basic plots
Use appropriate enhancements in the appearance of the plot and its components like axes, texts, labels, titles etc,
Plan multiple plots in the required lay outs with available options
Increase the dimensions in a plot by increasing as many variables as required to compare, as well as retaining the readability of the plot for better insight