options(scipen=999) # turn off scientific notation like 2e+08
library(ggplot2)
midwest = data("midwest", package = "ggplot2") # load the data
midwest <- read.csv("http://goo.gl/G1K41K") # alt source
head(midwest)
## PID county state area poptotal popdensity popwhite popblack
## 1 561 ADAMS IL 0.052 66090 1270.9615 63917 1702
## 2 562 ALEXANDER IL 0.014 10626 759.0000 7054 3496
## 3 563 BOND IL 0.022 14991 681.4091 14477 429
## 4 564 BOONE IL 0.017 30806 1812.1176 29344 127
## 5 565 BROWN IL 0.018 5836 324.2222 5264 547
## 6 566 BUREAU IL 0.050 35688 713.7600 35157 50
## popamerindian popasian popother percwhite percblack percamerindan
## 1 98 249 124 96.71206 2.5752761 0.1482826
## 2 19 48 9 66.38434 32.9004329 0.1788067
## 3 35 16 34 96.57128 2.8617170 0.2334734
## 4 46 150 1139 95.25417 0.4122574 0.1493216
## 5 14 5 6 90.19877 9.3728581 0.2398903
## 6 65 195 221 98.51210 0.1401031 0.1821340
## percasian percother popadults perchsd percollege percprof
## 1 0.37675897 0.18762294 43298 75.10740 19.63139 4.355859
## 2 0.45172219 0.08469791 6724 59.72635 11.24331 2.870315
## 3 0.10673071 0.22680275 9669 69.33499 17.03382 4.488572
## 4 0.48691813 3.69733169 19272 75.47219 17.27895 4.197800
## 5 0.08567512 0.10281014 3979 68.86152 14.47600 3.367680
## 6 0.54640215 0.61925577 23444 76.62941 18.90462 3.275891
## poppovertyknown percpovertyknown percbelowpoverty percchildbelowpovert
## 1 63628 96.27478 13.151443 18.01172
## 2 10529 99.08714 32.244278 45.82651
## 3 14235 94.95697 12.068844 14.03606
## 4 30337 98.47757 7.209019 11.17954
## 5 4815 82.50514 13.520249 13.02289
## 6 35107 98.37200 10.399635 14.15882
## percadultpoverty percelderlypoverty inmetro category
## 1 11.009776 12.443812 0 AAR
## 2 27.385647 25.228976 0 LHR
## 3 10.852090 12.697410 0 AAR
## 4 5.536013 6.217047 1 ALU
## 5 11.143211 19.200000 0 AAR
## 6 8.179287 11.008586 0 AAR
# Initial Ggplot
ggplot(midwest, aes(x=area, y=poptotal)) # area and poptotal are columns in 'midwest'
There is no data points becuase I did not specify if I want a scatter plot or a line chart. Any information that is part of the source dataframe has to be specified inside the aes() function.
Lets make a simple ggplot
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + geom_point()
Each point on the scatterplot represents a county.geom_point() simply mean creating a scatter plot. You can also create geom_jitter(), geom_count(), or geom_bin2d().
To add a line of best fit, you can use geom_smoooth() as shown below:
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
plot(g)
Here method = ‘lm’ simply means draw the line of best fir for a linear model.
The X and Y axis limits can be controlled in 2 ways. 1. By Excluding the points out of the range. This is done by using xlim() and ylim(). 2. By zooming into the region of interest without deleting the outliers. This is done by using coord_cartesian().
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
# Delete the points outside the limits
g + xlim(c(0, 0.1)) + ylim(c(0, 1000000)) # deletes points
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
# g + xlim(0, 0.1) + ylim(0, 1000000) # deletes points
Notice that the line of best fit became more horizontal compared to the original plot. This is because, when using xlim() and ylim(), the points outside the specified range are deleted and will not be considered while drawing the line of best fit (using geom_smooth(method=‘lm’)). This feature might come in handy when you wish to know how the line of best fit would change when some extreme values (or outliers) are removed.
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
# Zoom in without deleting the points outside the limits.
# As a result, the line of best fit is the same as the original plot.
g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in
plot(g1)
Since all points were considered, the line of best fit did not change.
Again, this can be done by either using the labs() function with title, x and y arguments or by simple using ggtitle(), xlab() and ylab().
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set se=FALSE to turnoff confidence bands
g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in
# Add Title and Labels
g1 + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
# or
g1 + ggtitle("Area Vs Population", subtitle="From midwest dataset") + xlab("Area") + ylab("Population")
All the features can be added in a single ggplot() function call as shown below:
# Full Plot call
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method="lm") +
coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
We can change the aesthetics of a geom layer by modifying the respective geoms. Let’s change the color of the points and the line to a static value.
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(col="red", size=3) + # Set static color and size for points
geom_smooth(method="lm", col="firebrick") + # change the color of line
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
if we want the color to change based on another column in the source dataset, it must be specified inside the aes() function.
library(ggplot2)
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
plot(gg)
As an added benefit, the legend is added automatically. If needed, it can be removed by setting the legend.position to None from within a theme() function.
gg + theme(legend.position="None") # remove legend
gg + scale_colour_brewer(palette = "Set1") # change color palette
More of such palettes can be found in the RColorBrewer package
library(RColorBrewer)
head(brewer.pal.info, 10) # show 10 palettes
## maxcolors category colorblind
## BrBG 11 div TRUE
## PiYG 11 div TRUE
## PRGn 11 div TRUE
## PuOr 11 div TRUE
## RdBu 11 div TRUE
## RdGy 11 div FALSE
## RdYlBu 11 div TRUE
## RdYlGn 11 div FALSE
## Spectral 11 div FALSE
## Accent 8 qual FALSE
Let’s see how to change the X and Y axis text and its location This can be done in 2 steps 1. Set the breaks The breaks should be of the same scale as the X axis variable. Note that I am using scale_x_continuous because, the X axis variable is a continuous variable. Had it been a date variable, scale_x_date could be used. Like scale_x_continuous() an equivalent scale_y_continuous() is available for Y axis.
library(ggplot2)
# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
# Change breaks
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))
2: Change the labels You can optionally change the labels at the axis ticks. labels take a vector of the same length as breaks. Let me demonstrate by setting the labels to alphabets from a to k (though there is no meaning to it in this context).
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
# Change breaks + label
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = letters[1:11])
If you need to reverse the scale, use scale_x_reverse().
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
# Reverse X Axis Scale
gg + scale_x_reverse()
To write customized texts for axis labels, I have used 2 methods Method 1: Using sprintf(). (Have formatted it as % in below example) Method 2: Using a custom user defined function. (Formatted 1000’s to 1K scale) Use whichever method feels convenient.
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
# Change Axis Texts
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = sprintf("%1.2f%%", seq(0, 0.1, 0.01))) +
scale_y_continuous(breaks=seq(0, 1000000, 200000), labels = function(x){paste0(x/1000, 'K')})
To Customize the entire theme in one shot using Pre-Built Themes, I have used 2 different methods. Method 1: Use the theme_set() to set the theme before drawing the ggplot. Note that this setting will affect all future plots. Method 2: Draw the ggplot and then add the overall theme setting (eg. theme_bw())
# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
gg <- gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))
# method 1: Using theme_set()
theme_set(theme_classic()) # not run
gg
# method 2: Adding theme Layer itself.
gg + theme_bw() + labs(subtitle="BW Theme")
gg + theme_classic() + labs(subtitle="Classic Theme")
More to come.. +Creating advanced visualizations : modifying theme components, manipulating legend, annotations, faceting and custom layouts.