***********************************************************************************************************

This code is a comprehensive run throigh for building visualizations using the ggplot2 library

Submitted By : Sumeet Kukreja

emailid : skukre3@uic.edu | sumeetkukreja92@gmail.com | linkedin.com/sumeetkukreja

***********************************************************************************************************

******************************************************

Understanding the Syntax

******************************************************

setup

The most important thing to know while working with ggplot2 is that is only works with a dataframe and not #lists. Also, you can add more layers and themes on an existing ggplot to enhance it accordingly

options(scipen=999)  # turn off scientific notation like 2e+08
library(ggplot2)
midwest = data("midwest", package = "ggplot2")  # load the data
midwest <- read.csv("http://goo.gl/G1K41K") # alt source 
head(midwest)
##   PID    county state  area poptotal popdensity popwhite popblack
## 1 561     ADAMS    IL 0.052    66090  1270.9615    63917     1702
## 2 562 ALEXANDER    IL 0.014    10626   759.0000     7054     3496
## 3 563      BOND    IL 0.022    14991   681.4091    14477      429
## 4 564     BOONE    IL 0.017    30806  1812.1176    29344      127
## 5 565     BROWN    IL 0.018     5836   324.2222     5264      547
## 6 566    BUREAU    IL 0.050    35688   713.7600    35157       50
##   popamerindian popasian popother percwhite  percblack percamerindan
## 1            98      249      124  96.71206  2.5752761     0.1482826
## 2            19       48        9  66.38434 32.9004329     0.1788067
## 3            35       16       34  96.57128  2.8617170     0.2334734
## 4            46      150     1139  95.25417  0.4122574     0.1493216
## 5            14        5        6  90.19877  9.3728581     0.2398903
## 6            65      195      221  98.51210  0.1401031     0.1821340
##    percasian  percother popadults  perchsd percollege percprof
## 1 0.37675897 0.18762294     43298 75.10740   19.63139 4.355859
## 2 0.45172219 0.08469791      6724 59.72635   11.24331 2.870315
## 3 0.10673071 0.22680275      9669 69.33499   17.03382 4.488572
## 4 0.48691813 3.69733169     19272 75.47219   17.27895 4.197800
## 5 0.08567512 0.10281014      3979 68.86152   14.47600 3.367680
## 6 0.54640215 0.61925577     23444 76.62941   18.90462 3.275891
##   poppovertyknown percpovertyknown percbelowpoverty percchildbelowpovert
## 1           63628         96.27478        13.151443             18.01172
## 2           10529         99.08714        32.244278             45.82651
## 3           14235         94.95697        12.068844             14.03606
## 4           30337         98.47757         7.209019             11.17954
## 5            4815         82.50514        13.520249             13.02289
## 6           35107         98.37200        10.399635             14.15882
##   percadultpoverty percelderlypoverty inmetro category
## 1        11.009776          12.443812       0      AAR
## 2        27.385647          25.228976       0      LHR
## 3        10.852090          12.697410       0      AAR
## 4         5.536013           6.217047       1      ALU
## 5        11.143211          19.200000       0      AAR
## 6         8.179287          11.008586       0      AAR
# Initial Ggplot
ggplot(midwest, aes(x=area, y=poptotal))  # area and poptotal are columns in 'midwest'

There is no data points becuase I did not specify if I want a scatter plot or a line chart. Any information that is part of the source dataframe has to be specified inside the aes() function.

******************************************************

Building your first plot

******************************************************

Lets make a simple ggplot

library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + geom_point()

Each point on the scatterplot represents a county.geom_point() simply mean creating a scatter plot. You can also create geom_jitter(), geom_count(), or geom_bin2d().

To add a line of best fit, you can use geom_smoooth() as shown below:

library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm")  # set se=FALSE to turnoff confidence bands
plot(g)

Here method = ‘lm’ simply means draw the line of best fir for a linear model.

******************************************************

Changing limits of axes

******************************************************

The X and Y axis limits can be controlled in 2 ways. 1. By Excluding the points out of the range. This is done by using xlim() and ylim(). 2. By zooming into the region of interest without deleting the outliers. This is done by using coord_cartesian().

library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm")  # set se=FALSE to turnoff confidence bands

# Delete the points outside the limits
g + xlim(c(0, 0.1)) + ylim(c(0, 1000000))   # deletes points
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

# g + xlim(0, 0.1) + ylim(0, 1000000)   # deletes points

Notice that the line of best fit became more horizontal compared to the original plot. This is because, when using xlim() and ylim(), the points outside the specified range are deleted and will not be considered while drawing the line of best fit (using geom_smooth(method=‘lm’)). This feature might come in handy when you wish to know how the line of best fit would change when some extreme values (or outliers) are removed.

library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm")  # set se=FALSE to turnoff confidence bands

# Zoom in without deleting the points outside the limits. 
# As a result, the line of best fit is the same as the original plot.
g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000))  # zooms in
plot(g1)

Since all points were considered, the line of best fit did not change.

******************************************************

Changing the Title and Axis Labels

******************************************************

Again, this can be done by either using the labs() function with title, x and y arguments or by simple using ggtitle(), xlab() and ylab().

library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm")  # set se=FALSE to turnoff confidence bands

g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000))  # zooms in

# Add Title and Labels
g1 + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# or

g1 + ggtitle("Area Vs Population", subtitle="From midwest dataset") + xlab("Area") + ylab("Population")

All the features can be added in a single ggplot() function call as shown below:

# Full Plot call
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point() + 
  geom_smooth(method="lm") + 
  coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

******************************************************

Change the Color and Size of Points

******************************************************

We can change the aesthetics of a geom layer by modifying the respective geoms. Let’s change the color of the points and the line to a static value.

library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(col="red", size=3) +   # Set static color and size for points
  geom_smooth(method="lm", col="firebrick") +  # change the color of line
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

if we want the color to change based on another column in the source dataset, it must be specified inside the aes() function.

library(ggplot2)
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
plot(gg)

As an added benefit, the legend is added automatically. If needed, it can be removed by setting the legend.position to None from within a theme() function.

gg + theme(legend.position="None")  # remove legend

gg + scale_colour_brewer(palette = "Set1")  # change color palette

More of such palettes can be found in the RColorBrewer package

library(RColorBrewer)
head(brewer.pal.info, 10)  # show 10 palettes
##          maxcolors category colorblind
## BrBG            11      div       TRUE
## PiYG            11      div       TRUE
## PRGn            11      div       TRUE
## PuOr            11      div       TRUE
## RdBu            11      div       TRUE
## RdGy            11      div      FALSE
## RdYlBu          11      div       TRUE
## RdYlGn          11      div      FALSE
## Spectral        11      div      FALSE
## Accent           8     qual      FALSE

******************************************************

Changing the X Axis Texts and Ticks Location

******************************************************

Let’s see how to change the X and Y axis text and its location This can be done in 2 steps 1. Set the breaks The breaks should be of the same scale as the X axis variable. Note that I am using scale_x_continuous because, the X axis variable is a continuous variable. Had it been a date variable, scale_x_date could be used. Like scale_x_continuous() an equivalent scale_y_continuous() is available for Y axis.

library(ggplot2)

# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Change breaks
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))

2: Change the labels You can optionally change the labels at the axis ticks. labels take a vector of the same length as breaks. Let me demonstrate by setting the labels to alphabets from a to k (though there is no meaning to it in this context).

library(ggplot2)

# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Change breaks + label
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = letters[1:11])

If you need to reverse the scale, use scale_x_reverse().

gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Reverse X Axis Scale
gg + scale_x_reverse()

To write customized texts for axis labels, I have used 2 methods Method 1: Using sprintf(). (Have formatted it as % in below example) Method 2: Using a custom user defined function. (Formatted 1000’s to 1K scale) Use whichever method feels convenient.

# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Change Axis Texts
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = sprintf("%1.2f%%", seq(0, 0.1, 0.01))) + 
  scale_y_continuous(breaks=seq(0, 1000000, 200000), labels = function(x){paste0(x/1000, 'K')})

To Customize the entire theme in one shot using Pre-Built Themes, I have used 2 different methods. Method 1: Use the theme_set() to set the theme before drawing the ggplot. Note that this setting will affect all future plots. Method 2: Draw the ggplot and then add the overall theme setting (eg. theme_bw())

# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

gg <- gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))

# method 1: Using theme_set()
theme_set(theme_classic())  # not run
gg

# method 2: Adding theme Layer itself.
gg + theme_bw() + labs(subtitle="BW Theme")

gg + theme_classic() + labs(subtitle="Classic Theme")

More to come.. +Creating advanced visualizations : modifying theme components, manipulating legend, annotations, faceting and custom layouts.

***********************************************************************************************************