ggplot2

Beware! This is just a cheat sheet for ggplot2 created by keeping in mind the reading audience is just me.

ggplot2 is the package designed to get high level data visualisation. Its more of a grammer for plot!

Lets start with qplot function.

1. qplot

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5

By default we have a dataframe named mpg in it. Lets look at its attributes.

str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

Making a simple plot

Now plotting the hwy vs displ attributes.

qplot(x,y,data=DataFrame)

qplot(hwy,displ,data=mpg)

Adding a third Attribute as color

Adding third attribute as color breakdown. Adding third attribute drv to the chart

qplot(x,y,data=DataFrame,color=z)

qplot(hwy,displ,data=mpg,color=drv)

Adding a trend line

Adding a geom

It will add some kind of statics to the plot and give some kind of overall trend.

So here we will add 2 geoms.

  1. “point” - So that we can plot the points
  2. “smooth” - So that we can add trend lines with a shadow portion showing ~90% confidence interval of the line. By default, smooth uses LOESS Regression algorithm to give the trend line. (https://en.wikipedia.org/wiki/Local_regression)

qplot(x,y,data=DataFrame,geom=c(“point”,“smooth”))

qplot(hwy,displ,data=mpg,geom=c("point","smooth"))

Making a histogram

Lets make a histogram for hwy attribute with drv breakdown in colors.

qplot(x,data=DataFrame,fill=y)

qplot(hwy,data=mpg,fill=drv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Creating Facets (Making Panels)

By this we can create seperate plots which can be the subsets of data according too the attributes.

Here, lets seperate the plots in terms of drv attribute.

qplot(x,y,data=DataFrame,facets=.~z)

qplot(hwy,displ,data=mpg,facets=.~drv)

Lets create histogram with facets.

qplot(x,data=DataFrame,facets=.~z,binwidth=n)

binwidth is nothing but the width of range for each histogram bars.

qplot(hwy,data=mpg,facets=.~drv,binwidth=2)

we can even add color differentiations

qplot(hwy,data=mpg,fill=drv,facets=.~drv,binwidth=2)

This plot was break downed into number of columns, to change it to breaking into rows we just have to tweak it by exchanging “.” with drv while “~” remains at its place.

qplot(hwy,data=mpg,fill=drv,facets=drv~.,binwidth=2)

Lets work on more comprehensive dataset. There is another dataset preloaded in R. Its iris.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Making a logarithmic plot

Just change the attribute name to log(attribute)

qplot(log(Petal.Length),data=iris,fill=Species,binwidth=0.07)

Making a density plot

Just change geom value to “density”

qplot(Petal.Length,data=iris,color=Species,geom="density")

Making it more appealing

qplot(Petal.Length,data=iris,color=Species,fill=Species,geom="density")

Here the data is getting overlapped in the plot. So, the behind data, lets make the plot a little transparent.

We will use alpha to do it. With 0 being completely transparent and 1 being completely opaque.

qplot(Petal.Length,data=iris,color=Species,fill=Species,alpha=I(5/10),geom="density")

Here is the complete qplot Defination

qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)

1. alpha - Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity)

2. color, shape, size, fill - Associates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically.

3. data - Specifies a data frame

4. facets - Creates a trellis graph by specifying conditioning variables. Its value is expressed as rowvar ~ colvar. To create trellis graphs based on a single conditioning variable, use rowvar~. or .~colvar)

5. geom - Specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include “point”, “smooth”, “boxplot”, “line”, “histogram”, “density”, “bar”, and “jitter”.

6. main, sub - Character vectors specifying the title and subtitle

7. method, formula - If geom=“smooth”, a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include “lm” for regression, “gam” for generalized additive models, and “rlm” for robust regression. The formula parameter gives the form of the fit.

For example, to add simple linear regression lines, you’d specify geom=“smooth”, method=“lm”, formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables.

For method=“gam”, be sure to load the mgcv package. For method=“rml”, load the MASS package.

8. x, y - Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y

9. xlab, ylab - Character vectors specifying horizontal and vertical axis labels

10. xlim,ylim - Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively