1 Introduction

In this session we will extend the simple visualizations you have done in earlier sessions using core (base) R plotting functions to the ggplot2 package - a dedicated visualization package based on the Grammar of Graphics (Wilkinson, 2005) (hence the gg in the name of the package). This conceptualizes graphics (and plots) in terms of their theoretical components. The approach is to handle each element of the graphic separately in a series of layers and in so doing to control each part of the plot. This is different to the basic plot functions which applies specific plotting functions based on the type or class of data that were passed to them.

2 Loading Required Libraries

library(ggplot2)

3 Quick plots

Plots can be created using either the qplot or ggplot functions in the ggplot2 package. The function qplot is used to produce quick simple plots in a similar way to the plot function. It takes x and y as and a data argument for a data.frame containing x and y. The figure below is created by defining a vector of the sequence from 0 to 2 pi (x), its sin (y) and a y vector with a small random error term (y2).

Define data

x <- seq(0,2*pi,len=100)
y = sin(x)
y2 <- y + rnorm(100,0,0.1)

These can then be plotted using qplot:

qplot(x,y2,col=I('darkred')) + 
  geom_line(aes(x,y), col="darkblue", size = 1.5)

Notice how the plot type is first specified (in this case qplot()) and then subsequent lines include instructions for what to plot an how to plot them. Here geom_lines() was specified followed by some style instructions.

Try adding

  theme_bw()

or

  theme_dark()

to the above. Remember that you need to include a + for each additional element in ggplot.

qplot(x,y2,col=I('darkred')) +    
geom_line(aes(x,y), col="darkgreen", size = 1.5) +
theme(axis.text=element_text(size=20) +   
theme_bw()

The different theme available are

  1. theme_gray() – Default ggplot2 theme
  2. theme_bw() – Black and white theme
  3. theme_linedraw() – Uses only black lines on white background
  4. theme_light() – Similar to bw() but with lighter grid lines
  5. theme_dark() – Dark background, good for presentations
  6. theme_minimal() – Very clean, minimal theme
  7. theme_classic() – Clean and classic look with axes
  8. theme_void() – Removes all background, axes, and labels
  9. theme_test() – Simple theme useful for testing

Now, we will focus on developing different kinds of plots for different kinds of variables

  • scatterplots

  • histograms

  • boxplots

  • plots of counts

  • bar plots

Lets get data for practice

link="https://raw.githubusercontent.com/bijayprad/data/refs/heads/main/CS_Bank.csv"
data=read.csv(link)
head(data,2)  # see the data with two row

change the variables names

names(data) <- c("RN", "score", "sex","ms","income","age","edu")
head(data,2)

change the variable level

data$sex=as.factor(data$sex)
levels(data$sex) <- list("male" = 1, "female"=2)
data$ms=as.factor(data$ms)
levels(data$ms) <- list("married" = 1, "unmarried"=2)

The data income can be categories as

data$IncClass <- rep("Medium", nrow(data)) 
data$IncClass[data$income >=  20000] = "High"
data$IncClass[data$income <=  12800] = "Low"

This should be make factor

data$IncClass <- factor(data$IncClass,
                      levels=c("Low","Medium","High"), 
                      ordered=TRUE)

The distributions can be checked

table(data$IncClass)

Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of count in the data dataset, grouped by IncClass.

ggplot(data = data) + 
  geom_bar(mapping = aes(x = IncClass))

On the x-axis, the chart displays the IncClass variable and the y-axis automatically displays the count. So unlike scatterplots which the raw values of your dataset. Others like bar charts, calculate new values to plot:

bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count().

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():

ggplot(data = tb) + 
  stat_count(mapping = aes(x = IncClass))

4 Scatterplots

Scatterplots show 2 variables together and we can examine data pairs in the census R object. For example consider income and age, representing the % of the county population with sex and that may be male or female.

ggplot(data = data, mapping = aes(x = income, y = age)) + 
  geom_point()
ggplot(data = data, mapping =  aes(x = income, y = age ,colour=sex)) + 
  geom_point()

Now try using the IncClass variable created earlier as the group. What happens? What do you see? Does this make sense? Are there any trends.

This might be confirmed by adding a trend line:

ggplot(data = data, mapping = aes(x = income, y = age)) + 
  geom_point() +
  geom_smooth()

By default the type of trend line is automatically selected. Try adjusting the code above and specifying the trend line as below:

geom_smooth(method = "lm")

Also note that style templates can be added and colours changed:

ggplot(data = data, mapping = aes(x = income, y = age)) + geom_point() + geom_smooth(method = "lm", col = "red", fill = "lightsalmon") + theme_dark()

As with all plots the axis labels and title can be specified

ggplot(data = data, mapping = aes(x = income, y = age)) + 
  geom_point(shape = 23, fill = "red") +
  geom_smooth(method = "lm") + 
  theme_bw() +
  xlab("income of respondent") +
  ylab("age of respondent") +
  ggtitle("My title", subtitle = "my subtitle")

5 Histograms

We can use histograms to examine the distributions of the income.

ggplot(data, aes(x=income)) + 
  geom_histogram(, binwidth = 5000, colour = "red", fill = "blue")

We can add density plot

ggplot(data, aes(x=income)) + 
  geom_histogram(aes(y=..density..),binwidth=5000,colour="white")+
  geom_density(alpha=.4, fill="green") +
  geom_vline(aes(xintercept=median(income)),color="red",linetype="dashed", size=1)

And we can also plot histograms of multiple groups of data values together on the same plot to perhaps compare distributions in a single graphic using the fill option. Obviously we have to be careful about which variables we pot together for this to make sense.

ggplot(data=data,aes(x=age,fill=IncClass)) + 
  geom_histogram(binwidth = 10, colour="white") 

Multiple histogram plots can be generated using the facet() options in ggplot. These create separate plots for each group. Here the Income class variable is plotted and incomes compared

ggplot(data, aes(x=income, fill=IncClass)) +
  geom_histogram(color="black",binwidth = 1000) +
  scale_fill_manual("Income Class",values = c("orange", "yellow","red")) +
  facet_grid(IncClass~.)

Similarly we can see for different sex also

ggplot(data, aes(x=income, fill=sex)) +
  geom_histogram(color="black",binwidth = 1000)+
  scale_fill_manual("sex",values = c("orange", "yellow")) +
  facet_grid(sex~.)

6 Boxplots

Perhaps a better way of examining distributions is through boxplots. Boxplots display the distribution of a continuous variable.

Each boxplot consists of:

  • A box that stretches from the 25th to 75th percentile of the distribution (IQR).

  • A line that displays the median, i.e. 50th percentile of the distribution

  • Lines tell you about distribution and skew

  • Points of observations > 1.5 times the IQR - outliers

  • A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.

First lets create some boxplots, using similar syntax to the other plot types above.

ggplot(data, aes(,income)) + 
  geom_boxplot() 

And we can extend boxplots with some grouping as before:

ggplot(data, aes(,income, fill = ms)) + 
  geom_boxplot() +
  scale_fill_manual("ms",values = c("orange", "red")) 

It also possible to extend the grouping to compare more than 1 treatment:

ggplot(data, aes(sex,income, fill = ms)) + 
  geom_boxplot() +
  scale_fill_manual("ms",values = c("orange", "red")) 

And of course the same information could be grouped and displayed using a different organisation:

ggplot(data, aes(ms,age, fill = IncClass)) + 
  geom_boxplot() +
  scale_fill_manual("IncClass",values = c("orange", "red","blue")) 

We can use facet here also

ggplot(data, aes(ms,age, fill = IncClass)) + 
  geom_boxplot() +
  scale_fill_manual("IncClass",values = c("orange", "red","blue"))+
facet_grid(IncClass~.)