A. R ENVIRONMENT

1. Working Directory

setwd("E:/Semester 2/Data Mining Visualisation")

2. Previous Commands

#history()
#loadhistory(file="Session1.Rhistory")

3. Save your command history

savehistory(file=“myfile”) # default is “.Rhistory”

B. Plotting Systems

1. The Base Plotting System

The base plotting system is the original plotting system for R. The basic model is sometimes referred to as the “artist’s palette” model. The idea is you start with blank canvas and build up from there.

In more R-specific terms, you typically start with plot function (or similar plot creating function) to initiate a plot and then annotate the plot with various annotation functions (text, lines, points, axis)

The base plotting system is often the most convenient plotting system to use because it mirrors how we sometimes think of building plots and analyzing data. If we don’t have a completely well-formed idea of how we want to look at some data, often we’ll start by “throwing some data on the page” and then slowly add more information to it as our thought process evolves.

The core plotting and graphics engine in R is encapsulated in the following packages:

graphics: contains plotting functions for the “base” graphing systems, including plot, hist, boxplot and many others.
grDevices: contains all the code implementing the various graphics devices, including X11, PDF, PostScript, PNG, etc.

The grDevices package contains the functionality for sending plots to various output devices. The graphics package contains the code for actually constructing and annotating plots.

## Create the plot / draw canvas
with(cars, plot(speed, dist))

The downside of the base plotting system is that it’s difficult to describe or translate a plot to others because there’s no clear graphical language or grammar that can be used to communicate what you’ve done. The only real way to describe what you’ve done in a base plot is to just list the series of commands/functions that you’ve executed, which is not a particularly compact way of communicating things. This is one problem that the ggplot2 package attempts to address.

2. Base Graphics

Base graphics are used most commonly and are a very powerful system for creating data graphics. There are two phases to creating a base plot:

Initializing a new plot
Annotating (adding to) an existing plot

Calling plot(x, y) or hist(x) will launch a graphics device (if one is not already open) and draw a new plot on the device. If the arguments to plot are not of some special class, then the default method for plot is called; this function has many arguments, letting you set the title, x axis label, y axis label, etc.

The base graphics system has many global parameters that can set and tweaked. These parameters are documented in ?par and are used to control the global behavior of plots, such as the margins, axis orientation, and other details. It wouldn’t hurt to try to memorize at least part of this help page!

Another typical base plot is constructed with the following code.

data(cars)

## Create the plot / draw canvas
with(cars, plot(speed, dist))

## Add annotation
title("Speed vs. Stopping distance")

Base plot with title

3. Simple Base Graphics

Histogram

Here is an example of a simple histogram made using the hist() function in the graphics package. If you run this code and your graphics window is not already open, it should open once you call the hist() function.

library(datasets)

## Draw a new plot on the screen device
hist(airquality$Ozone)

Ozone levels in New York City

Boxplot

Boxplots can be made in R using the boxplot() function, which takes as its first argument a formula. The formula has form of y-axis ~ x-axis. Anytime you see a ~ in R, it’s a formula. Here, we are plotting ozone levels in New York by month, and the right hand side of the ~ indicate the month variable. However, we first have to transform the month variable in to a factor before we can pass it to boxplot(), or else boxplot() will treat the month variable as continuous.

airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab = "Ozone (ppb)")

Ozone levels by month in New York City

Each boxplot shows the median, 25th and 75th percentiles of the data (the “box”), as well as +/- 1.5 times the interquartile range (IQR) of the data (the “whiskers”). Any data points beyond 1.5 times the IQR of the data are indicated separately with circles.

In this case the monthly boxplots show some interesting features. First, the levels of ozone tend to be highest in July and August. Second, the variability of ozone is also highest in July and August. This phenomenon is common with environmental data where the mean and the variance are often related to each other.

Scatterplot

Here is a simple scatterplot made with the plot() function.

with(airquality, plot(Wind, Ozone))

Scatterplot of wind and ozone in New York City

Generally, the plot() function takes two vectors of numbers: one for the x-axis coordinates and one for the y-axis coordinates. However, plot() is what’s called a generic function in R, which means its behavior can change depending on what kinds of data are passed to the function.

One thing to note here is that although we did not provide labels for the x- and the y-axis, labels were automatically created from the names of the variables (i.e. “Wind” and “Ozone”). This can be useful when you are making plots quickly, but it demands that you have useful descriptive names for the your variables and R objects.

4. Some Important Base Graphics Parameters

Many base plotting functions share a set of global parameters. Here are a few key ones:

pch: the plotting symbol (default is open circle)
lty: the line type (default is solid line), can be dashed, dotted, etc.
lwd: the line width, specified as an integer multiple
col: the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by name
xlab: character string for the x-axis label
ylab: character string for the y-axis label

Base Plotting Functions

The most basic base plotting function is plot(). The plot() function makes a scatterplot, or other type of plot depending on the class of the object being plotted. Calling plot() will draw a plot on the screen device (and open the screen device if not already open). After that, annotation functions can be called to add to the already-made plot.

Some key annotation functions are

lines: add lines to a plot, given a vector of x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
points: add points to a plot
text: add text labels to a plot using specified x, y coordinates
title: add annotations to x, y axis labels, title, subtitle, outer margin
mtext: add arbitrary text to the margins (inner or outer) of the plot
axis: adding axis ticks/labels

Add and customize titles
Add text (Text characteristics)
Add and customize legends
Change point shapes
Change line types
Change colors

1. Add Titles

Plot Titles Plot titles can be specified either directly to the plotting functions during the plot creation or by using the title() function (to add titles on an existing plot).

We make the plot with the plot() function and then add a title to the top of the plot with the title() function.

library(datasets)

## Make the initial plot
with(airquality, plot(Wind, Ozone))

## Add a title
title(main = "Ozone and Wind in New York City")

Base plot with annotation

# Add titles
barplot(c(2,5), main="Main title",
        xlab="X axis title",
        ylab="Y axis title",
        sub="Sub-title",
        col.main="red", col.lab="blue", col.sub="black")

# Increase the size of titles
barplot(c(2,5), main="Main title",
        xlab="X axis title",
        ylab="Y axis title",
        sub="Sub-title",
        cex.main=2, cex.lab=1.7, cex.sub=1.2)

2. Add Text(Text characteristics)

Graphic parameters are also used to specify text size, font, and style. !

#set.seed(1)

# Generate sample data
#x <- rnorm(500)
#y <- x + rnorm(500)

x <- airquality$Ozone
y <- airquality$Wind

plot(x, y, main = "My title", sub = "Subtitle",
     cex.main = 2,   # Title size
     cex.sub = 1.5,  # Subtitle size
     cex.lab = 3,    # X-axis and Y-axis labels size
     cex.axis = 0.5) # Axis labels size

You can set this argument to 1 for plain text, 2 to bold (default), 3 italic and 4 for bold italic text. This argumento won’t modify the title style.

plot(x, y, font = 2, main = "Bold") # Bold

plot(x, y, font = 3, main = "Italics") # Italics

plot(x, y, font = 4, main = "Bold italics") # Bold italics

You can also specify the style of each of the texts of the plot with the font.main, font.sub, font.axis and font.lab arguments.

plot(x, y,
     main = "My title",
     sub = "Subtitle",
     font.main = 1, # Title font style
     font.sub  = 2, # Subtitle font style
     font.axis = 3, # Axis tick labels font style
     font.lab  = 4) # Font style of X and Y axis labels

On the one hand, the mtext function in R allows you to add text to all sides of the plot box. There are 12 combinations (3 on each side of the box, as left, center and right align). You just need to change the side and adj to obtain the combination you need.

On the other, the text function allows you to add text or formulas inside the plot at some position setting the coordinates. In the following code block some examples are shown for both functions.

mtext does not support rotation, only horizontal adjustment with las = 1 for the vertical axis and vertical adjustment with las = 3 for the X-axis. If you need to rotate the text you can use text function with srt argument instead.

mtext()

line, to set the margin line where to set the text. Default value is 0. adj, to adjust the text in the reading direction from 0 to 1 (default value is 0.5).

plot(x, y, main = "Main title", cex = 2, col = "blue")

#---------------
# mtext function
#---------------

# Bottom-center
mtext("Bottom text", side = 1)

# Left-center
mtext("Left text", side = 2)

# Top-center
mtext("Top text", side = 3)

# Right-center
mtext("Right text", side = 4)


# Bottom-left
mtext("Bottom-left text", side = 1, adj = 0)

# Top-right
mtext("Top-right text", side = 3, adj = 1)


# Top with separation
mtext("Top higher text", side = 3, line = 2.5)

text()

plot(x, y, main = "Main title", cex = 2, col = "blue")
#--------------
# Text function
#--------------

# Add text at coordinates (-2, 2)
text(-2, 2, "More text")

# Rotate 45 degrees
text(3,2, label = "Text annotation",
     srt = 45) # Rotation


# Split the text in several lines
text(3, -2,
     label = "Text\n annotation") # Split text

3. Add Legends

The legend() function can be used. A simplified format is :

x and y : the co-ordinates to be used for the legend. Keywords can also be used for x : “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”. legend : the text of the legend col : colors of lines and points beside the text for legends

legend(x, y=NULL, legend, col)

# Generate some data
x<-1:10; y1=x*x; y2=2*y1
# First line plot
plot(x, y1, type="b", pch=19, col="red", xlab="x", ylab="y")
# Add a second line
lines(x, y2, pch=18, col="blue", type="b", lty=2)
# Add legends
legend("topleft", legend=c("Line 1", "Line 2"),
       col=c("red", "blue"), lty=1:2, cex=0.8)

4. Change Point Shape

Point symbols can be changed using the argument pch.

The following arguments can be used to change the color and the size of the points :

col : color (code or name) to use for the points bg : the background (or fill) color for the open plot symbols. It can be used only when pch = 21:25. cex : the size of pch symbols lwd : the line width for the plotting symbols

x<-c(2.2, 3, 3.8, 4.5, 7, 8.5, 6.7, 5.5)
y<-c(4, 5.5, 4.5, 9, 11, 15.2, 13.3, 10.5)
# Change plotting symbol using pch
plot(x, y, pch = 19, col="blue")

plot(x, y, pch = 18, col="red")

plot(x, y, pch = 24, cex=2, col="blue", bg="red", lwd=2)

5. Change Line Type

Line types can be changed using the graphical parameter lty. line type (lty) can be specified using either text (“blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”) or number (0, 1, 2, 3, 4, 5, 6). Note that lty = “solid” is identical to lty=1.

x=1:10; y=x*x
plot(x, y, type="l") # Solid line (by default)

plot(x, y, type="l", lty="dashed")# Use dashed line type

plot(x, y, type="l", lty="dashed", lwd=3)# Change line width

6. Change Colors

Built-in color names in R
Specifying colors by hexadecimal code
Using RColorBrewer palettes
Colors can be specified by names (e.g col=red) or with hexadecimal code (e.gcol =“#FFCC00”).

a. Built-in color names in R

# Generate a plot of color names which R knows about.
#++++++++++++++++++++++++++++++++++++++++++++
# cl : a vector of colors to plots
# bg: background of the plot
# rot: text rotation angle
#usage=showCols(bg="gray33")
showCols <- function(cl=colors(), bg = "grey",
                     cex = 0.75, rot = 30) {
    m <- ceiling(sqrt(n <-length(cl)))
    length(cl) <- m*m; cm <- matrix(cl, m)
    require("grid")
    grid.newpage(); vp <- viewport(w = .92, h = .92)
    grid.rect(gp=gpar(fill=bg))
    grid.text(cm, x = col(cm)/m, y = rev(row(cm))/m, rot = rot,
              vp=vp, gp=gpar(cex = cex, col = cm))
  }

To view all the built-in color names which R knows about (n = 657), use the following R code :

showCols(cl= colors(), bg="gray33", rot=30, cex=0.75)

## Loading required package: grid

# The first sixty color names
showCols(bg="gray20",cl=colors()[1:60], rot=30, cex=0.9)

# Barplot using color names
barplot(c(2,5), col=c("chartreuse", "blue4"))

2. Specifying colors by hexadecimal code

Barplot using hexadecimal color code

Source : http://www.visibone.com

barplot(c(2,5), col=c("#009999", "#0000FF"))

3. Using RColorBrewer palettes

#install.packages("RColorBrewer")
library("RColorBrewer")
display.brewer.all()

There are 3 types of palettes : sequential, diverging, and qualitative.

Sequential palettes are suited to ordered data that progress from low to high (gradient). The palettes names are : Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu YlOrBr, YlOrRd.
Diverging palettes put equal emphasis on mid-range critical values and extremes at both ends of the data range. The diverging palettes are : BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral
Qualitative palettes are best suited to representing nominal or categorical data. They not imply magnitude differences between groups. The palettes names are : Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3

4. Colors can be specified by names (e.g col=red) or with hexadecimal code (e.gcol = “#FFCC00”).

# use color names
barplot(c(2,5), col=c("blue", "red"))

# use hexadecimal color code
barplot(c(2,5), col=c("#009999", "#0000FF"))

C. a few different Plot types

library(insuranceData)
data(dataCar)

1. Pie Chart and why it should be avoided

Pie charts represent non-negative numerical data vectors in the form of a circular “pie” with one “slice” for each element of the vector, whose size is proportional to its relative value
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data
pie graphs are best used when precision isn’t particularly important, and when there are relatively few wedges to compare (or few that matter).

xTab <- table(dataCar$veh_body)
par(mfrow=c(1,2))
pie(xTab, col=rainbow(12))
barplot(xTab)

par(mfrow=c(1,1))

Clearly, the bar chart on the right summarizes the relative frequencies of the vehicle body types much more effectively here than the pie chart on the left. One reason is that many of the labels on the pie chart overlap badly enough that they cannot be read at all, but even ignoring this difficulty, the bar chart on the right gives us a much clearer picture of the magnitude of the differences in the relative frequencies of the different vehicle types in the dataset

Recommendation: Use pie chart if “we have to”: a boss or a customer insists, or pie charts are required in keeping with the format of a report or other document that includes our analysis results.

2. Bar Plots

Bar charts can be effective in displaying integer-valued numerical data
Extremely flexible, capable of generating both vertical bar charts and horizontal bar charts, along with other variations like the stacked bar chart

barplot(sort(xTab), cex.names = 0.7, las = 1, horiz = TRUE)

D. Multiple Plots

R makes it easy to combine multiple plots into one overall graph, using either the par( ) or layout( ) function.

1. par() function

With the par( ) function, you can include the option mfrow=c(nrows, ncols) to create a matrix of nrows x ncols plots that are filled in by row. mfcol=c(nrows, ncols) fills in the matrix by columns.

library(MASS)
data(Cars93)
par(mfrow=c(1,2)) # 2 figures arranged in 1 row and 2 columns
plot(Cars93$Horsepower, Cars93$MPG.city, ylim = c(15, 50)) # ylim specifies the upper and lower limit of the y-axis
title("Plot no. 1")
plot(Cars93$Horsepower, Cars93$MPG.highway, ylim = c(15, 50))
title("Plot no. 2")

par(mfrow=c(1,1))

par(mfrow=c(2,2))
plot(Cars93$Cylinders, Cars93$MPG.city, las = 2, ylim = c(15, 50))
title("MPG.city vs. Cylinders")
plot(Cars93$Cylinders, Cars93$MPG.highway, las = 2, ylim = c(15, 50))
title("MPG.highway vs. Cylinders")
plot(Cars93$MPG.city, Cars93$MPG.highway, xlim = c(15, 50),
ylim = c(15, 50))
title("MPG.highway vs. MPG.city")
abline(a = 0, b = 1, lty = 2, lwd = 2)
delta <- Cars93$MPG.highway - Cars93$MPG.city
plot(Cars93$Cylinders, delta, las = 2)
title("Mileage difference vs. Cylinders")

One advantage of the 2 Ã 2 plot array is that it creates an array of four square plots, all the same size and typically large enough to see useful details.

The upper two plots show boxplots of MPG.city and MPG.highway vs Cylinders, illustrating that both of these mileage values generally decline as the number of cylinders increases, with the rotary engine behaving essentially the same as the 8-cylinder engines. The lower left plot shows the highway mileage versus the city mileage, with an equality reference line to emphasize that the highway mileage is always greater than the city mileage. Finally, the lower right plot is a boxplot summary of the difference between these mileages (highway mileage minus city mileage) versus the number of cylinders.

2. layout() function

The layout( ) function has the form layout(mat) where mat is a matrix object specifying the location of the N figures to plot.
a little more complicated but greater flexibility in specifying the sizes, shapes, and positions of plot than using the mfrow parameter

layout.matrix <- matrix(c(2, 1, 0, 3), nrow = 2, ncol = 2)

layout(mat = layout.matrix,
       heights = c(1, 2), # Heights of the two rows
       widths = c(2, 2)) # Widths of the two columns
layout.show(3)

# Set plot layout
layout(mat = matrix(c(2, 1, 0, 3), 
                        nrow = 2, 
                        ncol = 2),
       heights = c(1, 2),    # Heights of the two rows
       widths = c(2, 1))     # Widths of the two columns

# Plot 1: Scatterplot
#mar â A numeric vector of length 4, which sets the margin sizes in the following order: bottom, left, top, and right. The default is c(5.1, 4.1, 4.1, 2.1).

par(mar = c(5, 4, 2, 0)) 
delta <- 
plot(x = Cars93$Horsepower,
     y = Cars93$MPG.city,
     xlab = "Horsepower", 
     ylab = "City Mileage")

# Plot 2: Top (height) boxplot
par(mar = c(4, 4, 0, 0))
plot(Cars93$Cylinders, Cars93$MPG.city, las = 2, ylim = c(15, 50))

# Plot 3: Right (weight) boxplot
par(mar = c(4, 3, 0, 0))
plot(Cars93$Cylinders, Cars93$MPG.highway, las = 2, ylim = c(15, 50))

library(MASS)
attach(UScereal)
str(UScereal)

## 'data.frame':    65 obs. of  11 variables:
##  $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2 1 2 1 6 4 5 1 ...
##  $ calories : num  212 212 100 147 110 ...
##  $ protein  : num  12.12 12.12 8 2.67 2 ...
##  $ fat      : num  3.03 3.03 0 2.67 0 ...
##  $ sodium   : num  394 788 280 240 125 ...
##  $ fibre    : num  30.3 27.3 28 2 1 ...
##  $ carbo    : num  15.2 21.2 16 14 11 ...
##  $ sugars   : num  18.2 15.2 0 13.3 14 ...
##  $ shelf    : int  3 3 3 1 2 3 1 3 2 1 ...
##  $ potassium: num  848.5 969.7 660 93.3 30 ...
##  $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2 2 2 2 2 2 2 2 ...

LET’S DO SOME PRACTICE

The UScereal data frame from the MASS package characterizes 65 breakfast cereals sold in the U.S., from information on their FDA-mandated labels. Three of the variables included in this data frame are the calories per serving (calories), the grams of fat per serving (fat), and a one-character manufacturer designation mfr.

Using the matrix function, construct the 2Ã2 matrix layoutMatrix with plot designations 1 and 2 in the first row, and 3 and 3 in the second, giving a single wide bottom plot. Display layoutMatrix and use the layout function to set up the plot array.

In the upper right position of the array, create a plot using the hist function with protein attribute. Set the width of your bins with c(0, 5, 10, 15, 20, 25). Specify x label “Grams per serving” and the title “Protein Content of Cereal”.
In the upper left position of the array, using the plot function with UScereal$shelf converted to a factor, generate a plot that attemps to show the relationship between ‘mfr’ versus ‘shelf’ converted. Specify x- and y-axes “mfr” and “shelf” and give the plot the title “shelf as factor”.
In the bottom plot, put a scatterpplot of fat vs calories with point symbol ‘filled triangle point-up red’. Add an equality reference line with abline function and specify the intercept of 0, slope of line of 1 , and dashed line type. Specify the appropriate title.

References

Paul Murrell (2011). R Graphics, CRC Press.

Hadley Wickham (2009). ggplot2, Springer.

Deepayan Sarkar (2008). Lattice: Multivariate Data Visualization with R, Springer.

https://r-coder.com/

https://bookdown.org/ndphillips/YaRrr/arranging-plots-with-parmfrow-and-layout.html

Rangkuman Lab Sesssion 1-2

Belinda Mutiara

2022-03-21