In this session we will extend the simple visualizations you have done in earlier sessions using core (base) R plotting functions to the ggplot2 package - a dedicated visualization package based on the Grammar of Graphics (Wilkinson, 2005) (hence the gg in the name of the package). This conceptualizes graphics (and plots) in terms of their theoretical components. The approach is to handle each element of the graphic separately in a series of layers and in so doing to control each part of the plot. This is different to the basic plot functions which applies specific plotting functions based on the type or class of data that were passed to them.
library(ggplot2)
Plots can be created using either the qplot or ggplot functions in the ggplot2 package. The function qplot is used to produce quick simple plots in a similar way to the plot function. It takes x and y as and a data argument for a data.frame containing x and y. The figure below is created by defining a vector of the sequence from 0 to 2 pi (x), its sin (y) and a y vector with a small random error term (y2).
Define data
x <- seq(0,2*pi,len=100)
y = sin(x)
y2 <- y + rnorm(100,0,0.1)
These can then be plotted using qplot:
qplot(x,y2,col=I('darkred')) +
geom_line(aes(x,y), col="darkblue", size = 1.5)
Notice how the plot type is first specified (in this
case qplot()) and then subsequent lines include
instructions for what to plot an how to plot them.
Here geom_lines() was specified followed by some style
instructions.
Try adding
theme_bw()
or
theme_dark()
to the above. Remember that you need to include a + for
each additional element in ggplot.
qplot(x,y2,col=I('darkred')) +
geom_line(aes(x,y), col="darkgreen", size = 1.5) +
theme(axis.text=element_text(size=20) +
theme_bw()
The different theme available are
Now, we will focus on developing different kinds of plots for different kinds of variables
scatterplots
histograms
boxplots
plots of counts
bar plots
Lets get data for practice
link="https://raw.githubusercontent.com/bijayprad/data/refs/heads/main/CS_Bank.csv"
data=read.csv(link)
head(data,2) # see the data with two row
change the variables names
names(data) <- c("RN", "score", "sex","ms","income","age","edu")
head(data,2)
change the variable level
data$sex=as.factor(data$sex)
levels(data$sex) <- list("male" = 1, "female"=2)
data$ms=as.factor(data$ms)
levels(data$ms) <- list("married" = 1, "unmarried"=2)
The data income can be categories as
data$IncClass <- rep("Medium", nrow(data))
data$IncClass[data$income >= 20000] = "High"
data$IncClass[data$income <= 12800] = "Low"
This should be make factor
data$IncClass <- factor(data$IncClass,
levels=c("Low","Medium","High"),
ordered=TRUE)
The distributions can be checked
table(data$IncClass)
Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of count in the data dataset, grouped by IncClass.
ggplot(data = data) +
geom_bar(mapping = aes(x = IncClass))
On the x-axis, the chart displays the IncClass variable and the y-axis automatically displays the count. So unlike scatterplots which the raw values of your dataset. Others like bar charts, calculate new values to plot:
bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count().
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():
ggplot(data = tb) +
stat_count(mapping = aes(x = IncClass))
Scatterplots show 2 variables together and we can examine data pairs in the census R object. For example consider income and age, representing the % of the county population with sex and that may be male or female.
ggplot(data = data, mapping = aes(x = income, y = age)) +
geom_point()
ggplot(data = data, mapping = aes(x = income, y = age ,colour=sex)) +
geom_point()
Now try using the IncClass variable created earlier as the group. What happens? What do you see? Does this make sense? Are there any trends.
This might be confirmed by adding a trend line:
ggplot(data = data, mapping = aes(x = income, y = age)) +
geom_point() +
geom_smooth()
By default the type of trend line is automatically selected. Try adjusting the code above and specifying the trend line as below:
geom_smooth(method = "lm")
Also note that style templates can be added and colours changed:
ggplot(data = data, mapping = aes(x = income, y = age)) + geom_point() + geom_smooth(method = "lm", col = "red", fill = "lightsalmon") + theme_dark()
As with all plots the axis labels and title can be specified
ggplot(data = data, mapping = aes(x = income, y = age)) +
geom_point(shape = 23, fill = "red") +
geom_smooth(method = "lm") +
theme_bw() +
xlab("income of respondent") +
ylab("age of respondent") +
ggtitle("My title", subtitle = "my subtitle")
We can use histograms to examine the distributions of the income.
ggplot(data, aes(x=income)) +
geom_histogram(, binwidth = 5000, colour = "red", fill = "blue")
We can add density plot
ggplot(data, aes(x=income)) +
geom_histogram(aes(y=..density..),binwidth=5000,colour="white")+
geom_density(alpha=.4, fill="green") +
geom_vline(aes(xintercept=median(income)),color="red",linetype="dashed", size=1)
And we can also plot histograms of multiple groups of data values together on the same plot to perhaps compare distributions in a single graphic using the fill option. Obviously we have to be careful about which variables we pot together for this to make sense.
ggplot(data=data,aes(x=age,fill=IncClass)) +
geom_histogram(binwidth = 10, colour="white")
Multiple histogram plots can be generated using the facet() options in ggplot. These create separate plots for each group. Here the Income class variable is plotted and incomes compared
ggplot(data, aes(x=income, fill=IncClass)) +
geom_histogram(color="black",binwidth = 1000) +
scale_fill_manual("Income Class",values = c("orange", "yellow","red")) +
facet_grid(IncClass~.)
Similarly we can see for different sex also
ggplot(data, aes(x=income, fill=sex)) +
geom_histogram(color="black",binwidth = 1000)+
scale_fill_manual("sex",values = c("orange", "yellow")) +
facet_grid(sex~.)
Perhaps a better way of examining distributions is through boxplots. Boxplots display the distribution of a continuous variable.
Each boxplot consists of:
A box that stretches from the 25th to 75th percentile of the distribution (IQR).
A line that displays the median, i.e. 50th percentile of the distribution
Lines tell you about distribution and skew
Points of observations > 1.5 times the IQR - outliers
A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
First lets create some boxplots, using similar syntax to the other plot types above.
ggplot(data, aes(,income)) +
geom_boxplot()
And we can extend boxplots with some grouping as before:
ggplot(data, aes(,income, fill = ms)) +
geom_boxplot() +
scale_fill_manual("ms",values = c("orange", "red"))
It also possible to extend the grouping to compare more than 1 treatment:
ggplot(data, aes(sex,income, fill = ms)) +
geom_boxplot() +
scale_fill_manual("ms",values = c("orange", "red"))
And of course the same information could be grouped and displayed using a different organisation:
ggplot(data, aes(ms,age, fill = IncClass)) +
geom_boxplot() +
scale_fill_manual("IncClass",values = c("orange", "red","blue"))
We can use facet here also
ggplot(data, aes(ms,age, fill = IncClass)) +
geom_boxplot() +
scale_fill_manual("IncClass",values = c("orange", "red","blue"))+
facet_grid(IncClass~.)