Preamble

This notes attempts to provide some essential details for learning visual summaries of a data set using plotly package in R. The learning methodology is explained through the knowledge of nature of a variable and possible list of plots associated with the variable. Also we shall note the ways to improve the presentation of a plot by enhancing necessary components of a plot.

Codes may be obtained using the tab appearing in the top right of each output

Ready Reckoner of suitable plots based on nature of variable

library(kableExtra)
library(DT)
p1=c("Plot 1","Plot 2","Plot 3","Plot 4")
p2=c("1-D","2-D More","2-D More","2-D More")
p3=c("Numeric","Numeric-Grouped by Factor (Dichot/Polychot)","Numeric and Numeric","Binary (Dichot/Polychot)")
p4=c("Histogram, Box plot, Density, Scatter Plot","Histogram, Box plot, Density, Scatter Plot","Scatter Plot","Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut")
plda=as.data.frame(cbind(p1,p2,p3,p4))
colnames(plda)=c("Plots","Dimension","Variable Type","Plot Type")

kable(rbind(plda)) %>%
  kable_styling(bootstrap_options ='striped')
Plots Dimension Variable Type Plot Type
Plot 1 1-D Numeric Histogram, Box plot, Density, Scatter Plot
Plot 2 2-D More Numeric-Grouped by Factor (Dichot/Polychot) Histogram, Box plot, Density, Scatter Plot
Plot 3 2-D More Numeric and Numeric Scatter Plot
Plot 4 2-D More Binary (Dichot/Polychot) Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut

Data Set

We shall make use of midwest dataset from ggplot2 package. We shall load plotly package for visualization.

library(plotly) 
library(dplyr)
library(GGally)
data(midwest,package='ggplot2')

Let us look at the nature of variables in midwest data

csl=1:length(midwest)
cn=names(midwest)

ct=c(rep("Character",2),
     "Factor",
     rep("Numeric",23),
     rep("Factor",2))
cv=c(rep("-",2),
     "Bar/Pie",
     rep("Scatter/Histogram/Box/Density",23),
     rep("Bar/Pie",2))     
cnu=c(rep("-",2),
      "Count/Proportion",
      rep("Mean/Median/Percentiles/Variance",23),
      rep("Count/Proportion",2))
meta=data.frame(
  sl.no=csl,Variable=cn,Type=ct,
  Visual_summary=cv,Numerical_summary=cnu
)
kbl(meta) %>% kable_material_dark()
sl.no Variable Type Visual_summary Numerical_summary
1 PID Character
2 county Character
3 state Factor Bar/Pie Count/Proportion
4 area Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
5 poptotal Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
6 popdensity Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
7 popwhite Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
8 popblack Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
9 popamerindian Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
10 popasian Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
11 popother Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
12 percwhite Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
13 percblack Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
14 percamerindan Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
15 percasian Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
16 percother Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
17 popadults Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
18 perchsd Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
19 percollege Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
20 percprof Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
21 poppovertyknown Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
22 percpovertyknown Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
23 percbelowpoverty Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
24 percchildbelowpovert Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
25 percadultpoverty Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
26 percelderlypoverty Numeric Scatter/Histogram/Box/Density Mean/Median/Percentiles/Variance
27 inmetro Factor Bar/Pie Count/Proportion
28 category Factor Bar/Pie Count/Proportion

Plotly an Introduction

plotly is an R package for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js. Any graph made with the plotly R package is powered by the JavaScript library plotly.js The plot_ly() function provides a ‘direct’ interface to plotly.js with some additional abstractions to help reduce typing. These abstractions, inspired by the Grammar of Graphics and ggplot2, make it much faster to iterate from one graphic to another, making it easier to discover interesting features in the data.

Workflow

The plotly package takes a purely functional approach to a layered grammar of graphics. Plot_ly code gets converted in to R list then changes to JSON object then only we are getting chart. We can see the work-flow in the image below

Components of plotly.js

In plotly.js terminology, a figure has two key components: data (aka, traces) and a layout.A trace defines a mapping from data and visuals. Every trace has a type (e.g., histogram, pie, scatter, etc.) and the trace type determines what other attributes (i.e., visual and/or interactive properties,like x, hover info, name) are available to control the trace mapping.

Visualization

We shall illustrate each type of plot (listed in the first table) with suitable Numeric / Factor variables; count variable can be treated as numeric for this purpose.

Numeric variables

We shall consider Numeric variables and the suitable plots for understanding them.

Single Numeric variable

We can attempt with Scatter Plot, histogram, Box plot, Density plot. For scatter plot with a single numeric variable we have to define values for both axes. So usually we consider index from 1 to number of rows of the data set as x axis value; y axis will be our (numeric) variable of interest

Considering population total Creating index from 1 to number of rows in the data, we use type argument to specify the plot type, which is scatter and mode to mention whether we want just points or lines over the points

Scatter plot

x=1:nrow(midwest)
plot_ly(midwest, x = ~x, y= ~poptotal,
              type = "scatter",mode='markers')

Enhancements

marker argument can be used to specify color,size and symbol of markers in scatter plot, which can be given as a list.

plot_ly(midwest, x = ~x, y= ~poptotal, type = "scatter",
              mode='markers',
              marker = list(color = "red", 
                            size=15,symbol='10',
                            line = list(color = 'yellow',
                                        width = 2)))

Or we can specify the aesthetics directly as below using I() for mentioning color and symbol.

plot_ly(midwest,x=~x,y=~poptotal,
        symbol=I('10'),
        color=I("indianred4"),
        size=10)

More options using layout() for formatting axis labels, title and ticks

Giving title and axes labels

plot_ly(midwest,x=~x,y=~poptotal,
        symbol=I('10'),
        color=I("indianred4"),
        size=10) %>% 
  layout(title = 'Midwest analysis',xaxis = list(title= "Index"),yaxis = list(title = "Population total"))

Title Aesthetics

font,tickfontand tickangle arguments are used to customize title aesthetics

plot_ly(midwest,x=~x,y=~poptotal,
        symbol=I('10'),
        color=I("indianred4"),
        size=10) %>%layout(title = 'Midwest analysis',
  font = list(
  family = "Cursive", size=17, color = '#665D1E'),
  xaxis = list(
  title= "Index",
  
  color = 'green',tickangle = -45,
  tickfont = list(
  family = "times", size = 14,color='orange')
  ),
  yaxis = list(title = "Population total"))

Modifying background
Background can be changed using arguments paper_bgcolor and plot_bgcolor

plot_ly(midwest,x=~x,y=~poptotal,
        symbol=I('10'),
        color=I("indianred4"),
        size=10) %>%layout(title = 'Midwest analysis',
  font = list(
  family = "times", size=17, color = '#665D1E'),
  xaxis = list(
  title= "Index",
  color = 'green',tickangle = -45,
  tickfont = list(
  family = "times", size = 14,color='orange')
  ),
  yaxis = list(title = "Population total"),
  paper_bgcolor = 'pink',
plot_bgcolor = 'moccasin'
  )

Contextual Change
Depending on the context of the data set / variable of interest, we may be interested to consider a partial set of values. Or, we may use mathematical transformations (for example, logarithm) of a variable. This can be done using range argument. Let us consider population total below 10,000

plot_ly(midwest,x=~x,y=~poptotal,
        symbol=I('10'),
        color=I("indianred4"),
        size=10) %>% 
  layout(title = 'Midwest Poptotal',xaxis = list(title= "Index"),yaxis = list(title = "Population total",range=c(0,10000)))

Customizing difference between ticks Some times we might wish to specify tick marks based on context, this can be done as below
dtick is used to give difference between ticks

plot_ly(midwest,x=~x,y=~poptotal,
        symbol=I('10'),
        color=I("indianred4"),
        size=10) %>% 
  layout(xaxis = list(dtick = 20, 
                           tickmode = "linear"))

Alphabetical ticks

ticklist is used to specify the alphabets/words we wish to represent and tickvals is used to mention their position

plot_ly(midwest,x=~x,y=~area,
        symbol=I('10'),
        color=I("indianred4"),
        size=10) %>% 
  layout(
    xaxis = list(
      ticktext = list(
        "Hundred", "Two hundred", "Three hundred", "Four hundred"),
          tickvals = c(100,200,300,400),
             tickmode = "array"),
    yaxis = list(
      ticktext = list(
        "a","b","c","d","e","f"),
      tickvals = c(0.02,0.04,0.06,0.08,0.1,0.12),
      tickmode = "array"))

More Single numeric plots

Line chart

plot_ly(midwest, x = ~x, y = ~area,
              type = 'scatter', mode = 'lines')

Lines with markers with specs

plot_ly(midwest, x = ~x, y = ~area, 
              type = 'scatter', mode = 'lines+markers',
              marker = list(color = 'rgba(50, 70, 193, .9)', 
                            size=8,symbol=I('5'),
                            line = list(color = 'rgba(152, 0, 0, .8)',
                                        width = 2)))

Box plot
type is specified as box,stroke argument is used to change outline color

plot_ly(midwest, y = ~area, type = 'box',color=I("gold"),stroke=I('blue'))

Histogram

plot_ly(midwest, x = ~area, type = 'histogram',color=I("gold"),stroke=I('blue'))

Density Plot
We find density first and then use the results to form line chart with area fill

# Finding density
d=density(midwest$percadultpoverty)
## X axis-->x component of d
## Y axis-->y component of d

plot_ly(x=d$x,y=d$y,
              type='scatter',mode='lines',
              fill='tozeroy',color =I('red'))

Single numeric variable based on Factor

Box plot based on Factor
color argument is used to specify Factor variable

plot_ly(midwest, y = ~area, type = 'box',color=~state,stroke=I('goldenrod'))

Histogram based on Factor

plot_ly(midwest, x = ~area, type = 'histogram',color=~state,stroke=I('aliceblue'))

Line chart based on Factor

plot_ly(midwest, x = ~x, y = ~area,color=~inmetro,
              type = 'scatter', mode = 'lines')

Density Plot based on Factor
For this we can find densities separately for each category of Factor variable and then add each density plot as trace above existing one

mid1 <- midwest %>% filter(inmetro=="0") %>% droplevels()
density1 <- density(mid1$poptotal)

mid2 <- midwest %>% filter(inmetro=="1") %>% droplevels()
density2 <- density(mid2$poptotal)

original_plot1 <- plot_ly(x = ~density1$x, 
               y = ~density1$y, 
               type = 'scatter',
               mode = 'lines',
               name = 'inmetro==0', 
               fill = 'tozeroy')
final_plot <- original_plot1 %>% add_trace(x = ~density2$x,
                         y = ~density2$y,
                         name = 'inmetro==1',
                         fill = 'tozeroy')
final_plot

More than one Numeric variable (Multivariate)

A Metric variable categorized by factor variables (binary / polychotomous) and paired with one or more numeric variables.

Color, size, and / or shape can be used to plot this option Instead of specifying single color we give the factor variable based on which we wish to analyze numeric variable. colors argument can be used to specify color palette

plot_ly(midwest, 
               x = ~x,
               y= ~percbelowpoverty,
               color=~state,
               type = "scatter",
              colors='Set2', mode='markers'
               ) %>% 
  layout(title = 'Percbelowpoverty based on State',xaxis = list(title= "Index"),yaxis = list(title = "Percbelowpoverty"))

Using symbol we can add another variable for example ‘inmetro’

plot_ly(midwest, 
               x = ~x,
               y= ~percbelowpoverty,
               color=~state,
               type = "scatter",
               symbol=~inmetro,
              colors='Set2', mode='markers'
               ) %>% 
  layout(title = 'Percbelowpoverty based on State and inmetro',xaxis = list(title= "Index"),yaxis = list(title = "Percbelowpoverty"))

we can add a numeric variable using size

plot_ly(midwest, 
               x = ~x,
               y= ~percbelowpoverty,
               color=~state,
               type = "scatter",
               symbol=~inmetro,
              colors='Set2', mode='markers',
              size=~area
               ) %>% 
  layout(title = 'Percbelowpoverty based on State and inmetro',xaxis = list(title= "Index"),yaxis = list(title = "Percbelowpoverty"))

Factor variables

We shall consider Factor variables and the suitable plots for understanding them.

Single Factor variable

Pie

plot_ly(midwest,labels=~state,type="pie")

Customizing text and its position
textposition and textinfo are used to specify position of text whether outside or inside(which is default) and text information whether we want just label or percent or both

plot_ly(midwest,labels=~state,type="pie",textposition = 'outside',
               textinfo = 'label+percent')

Customizing color and text orientation and wedge width and color

marker argument is used to customize the colors in pie, line is used to customize color and width of wedge and insidetextorientation is used to specify text orientation

plot_ly(midwest,labels=~state,type="pie",
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = 'darkred',size=20),
        marker = list(colors =c("aliceblue","green","blue","red","gold") ,
                      line = list(color = 'pink', width = 5)),
        
        insidetextorientation='radial')  

Donut
hole argument is used to specify the donut inside hole size

plot_ly(marker=list(
  colors=c("goldenrod","yellow","black","mistyrose","skyblue"))
  )%>%
  add_pie(data=count(midwest,state),
          labels=~state,
          values = ~n,
          hole=.4
          )%>%
  layout(title="Donut of Midwest states",
         font = list(family = "Cursive", 
                     size=17, color = 'Orange'))

Rotating the plot rotaion is used to rotate the plot to specified degrees

plot_ly(rotation=180,marker=list(
  colors=c("goldenrod","yellow","black","mistyrose","skyblue"))
)%>%
  add_pie(data=count(midwest,state),
          labels=~state,
          values = ~n,
          hole=0.5
          )%>%
  layout(title="Midwest analysis",
         font = list(family = "Cursive", 
                     size=17,
                     color = 'Orange'))

Horizontal Bar Plot

midwest %>%
  count(state) %>%
  
  plot_ly(x = ~n,
          y = ~state,
          type="bar") %>%
  layout(title="Midwest analysis",
         font = list(family = "sans serif",
                     size=17,
                     
                     color = 'khaki'),
         xaxis = list(title = "Number of Observations",
                      tickfont = list(
                        family = "Times",
                        size = 14,
                        color = 'darkgreen')),
         yaxis = list(title = 'States',
         tickfont = list(
           family = "Times", 
           size = 14,
           color = 'red')),
         paper_bgcolor = 'green',
         plot_bgcolor = 'whitesmoke')

Bar Plot with text representing count

midwest %>% 
  count(inmetro) %>%
  plot_ly(x=~inmetro, y=~n, color=I("darkblue"),type='bar',
          text=~n,textposition='outside')

More than one Factor variable

Default plot is grouped bar Grouped bar

midwest %>%
  count(inmetro,state) %>% 
  plot_ly(x=~state,y=~n,color=~inmetro,type='bar')

Stacked bar

midwest %>%
  count(inmetro,state) %>% 
  plot_ly(x=~state,y=~n,color=~inmetro,type='bar') %>%
  layout(barmode='stack')

Legend position
legend is used to specify position of legend in x and y axis

midwest %>%
  count(inmetro,state) %>% 
  plot_ly(x=~state,y=~n,color=~inmetro,type='bar') %>%
  layout(barmode='stack',legend=list(x=0,y=-.5))

Hiding Legend

midwest %>%
  count(inmetro,state) %>% 
  plot_ly(x=~state,y=~n,color=~inmetro,type='bar') %>%
  layout(barmode='stack',showlegend=F)

Subplots

We use this subplot to arrange more than one plot to present in one time

p1 <- plot_ly(midwest, x = ~x, y= ~area,
              type = "scatter",mode='markers')

p2 =midwest %>%
  count(inmetro,state) %>% 
  plot_ly(x=~state,y=~n,color=~inmetro,type='bar')

p8 <- plot_ly(midwest, x = ~x, y = ~percbelowpoverty, 
              type = 'scatter', 
              mode = 'lines',
              line = list(color = 'rgb(205, 12, 24)', width = 2), name = 'Percentage Below Poverty')%>%
  add_trace(y = ~percadultpoverty, 
            line = list(color = 'rgb(22, 96, 167)', 
                        width = 2, dash = 'dash'), 
            name = 'Percentage Adult Poverty') 

ph=plot_ly(midwest,
           x =~percbelowpoverty, 
           type = 'histogram',
           color=~inmetro,
           colors=c(I("red"),I('blue'))
)
## default
subplot(p1,p2,p8,ph)

Specifying number of rows

subplot(p1,p2,p8,ph,nrows=2)

Specifying about Axie titles

subplot(p1,p2,p8,ph,nrows=2,titleX = T)

Multiple Plots using Facet

We use group_by and do to create plots for each category of factor variable and we then use subplot to obtain the desired output

midwest%>%
  group_by(state) %>%
  do(p=plot_ly(., x = ~area, y = ~popdensity, color = ~state,colors="Set2",
               type = "scatter",mode='markers')) %>%
  subplot(nrows = 1, shareX = TRUE, shareY = TRUE)

Using ggplotly

We can create and save a plot using ggplot2 and related packages and can convert it to plotly object. For example we can obtain corrplot using GGally, convert it to plotly diagram

g1 <- ggpairs(midwest,columns=c(4:8), # included only few  variables for better illustration
              lower = list(continuous = wrap("points",
color ="darkgreen",alpha = 0.25,size=2,shape=5)),
diag = list(continuous=wrap("densityDiag",color = "red",
                            fill="yellow")))
ggplotly(g1)

Final Note

This notes gives an outline of important plots which can be used for visualizing numeric as well as factor variables for understanding the data.

Now we may be ready to generate plots using the plotly and other necessary packages.
That is we may be able to

  • Know the different plots based on the nature of the variables

  • Create basic plots

  • Use appropriate enhancements in the appearance of the plot and its components like axes, texts, labels, titles etc,

  • Plan multiple plots in the required lay outs with available options

  • Increase the dimensions in a plot by increasing as many variables as required to compare, as well as retaining the readability of the plot for better insight