This is a demonstration of different visualization techniques using R (ggplot2 package)

We will use 2 datasets for this purpose:

I have used the below video from edureka as an inspiration but tried the plots on a different dataset to just try it out with a little bit of customization from my side.

Edureka Tutorial link

Let us start with the basic graphs using iris dataset

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
ggplot(iris, aes(y=Petal.Length, x= Petal.Width)) + geom_point()

Add color to the aesthetics. it can be added on categorical variable and legend will be auto populated

ggplot(iris, aes(y=Petal.Length, x= Petal.Width, col=Species)) + geom_point()

Add shape to the aesthetics. it can be added on categorical variable and legend will be auto populated

ggplot(iris, aes(y=Petal.Length, x= Petal.Width, col=Species, shape=Species)) + geom_point()

Let us move to diamonds dataset to move to the intermediate level of plots

Histogram

Histogram is used for visualizing continuous variable distribution.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Add the number of bins or number of histograms needed

ggplot(diamonds, aes(x=price)) +geom_histogram(bins=50)

Add color using fill and col to give boundary colors

ggplot(diamonds, aes(x=price)) +geom_histogram(bins=50, fill='palegreen4', col='red')

Use fill as an aesthetic (here color is a column name in daiamonds dateset). fill can be used on categorical variables

ggplot(diamonds, aes(x=price,fill=color)) +geom_histogram(bins=50)

If we use fill as an attribute in geom_histogram it will override the aesthtic values

ggplot(diamonds, aes(x=price,fill=color)) +geom_histogram(bins=50,fill='palegreen4', col='red')

If we use position =‘fill’ in the geom_histogram function then we get the proportion instead of the count

ggplot(diamonds, aes(x=price,fill=color)) +geom_histogram(bins=50, position='fill')

Bar Plot

Bar plot is used for visualizing categorical variable distribution

Put a categorical variable in aesthetic

ggplot(diamonds, aes(x=cut)) +geom_bar()

Use fill as an aesthetic (here clarity is a column name in daiamonds dateset). fill can be used on categorical variables Thus, we can visualize 2 categorical variables in a single graph.

ggplot(diamonds, aes(x=cut, fill=clarity)) +geom_bar(position = 'fill')

Frequency Polygon

Frequency Polygon as an alternative to histogram for continuous variable distribution

ggplot(diamonds, aes(x=price)) +geom_freqpoly(bins = 50)

Play with the size of the lines

ggplot(diamonds, aes(x=price)) +geom_freqpoly(bins = 50, size=2)

Add multiple frequency lines based on a catgorical variable using col in the aesthetics

ggplot(diamonds, aes(x=price, col=cut)) +geom_freqpoly(bins = 50, size=1)

Box Plots

Box Plots to understand how does a continuous variable change w.r.t a categorical variable

ggplot(diamonds, aes(x=factor(carat), y=price)) +geom_boxplot()

How does price (continuous variable) change w.r.t. cut (categorical variable) of the diamonds

ggplot(diamonds, aes(x=cut, y=price)) +geom_boxplot()

multi-variate analysis: add color to use 1 more categorical variable. use fill in the aesthetic on the color column

ggplot(diamonds, aes(x=cut, y=price, fill=color)) +geom_boxplot()

Smooth line

Smooth line for continuous variable to continuous variable analysis. grey area shown is the error.

ggplot(diamonds, aes(x=carat, y=price)) +geom_smooth()
## `geom_smooth()` using method = 'gam'

We can remove the error by using se=FALSE

ggplot(diamonds, aes(x=carat, y=price)) +geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam'

Add multiple lines using a categorical variable with the color aesthetic

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam'

method = lm means linear model Use both point and smooth geometry

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() +geom_smooth(method='lm',se=FALSE)

And we can add a categorical variable as well by using color

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() +geom_smooth(method='lm',se=FALSE)

Added 1 more categorical variable using shape aesthetic

ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=clarity)) + geom_point() +geom_smooth(method='lm',se=FALSE)
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 8.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 5445 rows containing missing values (geom_point).

This one became too messy. Maybe we should try and ignore putting so many variables in a single plot for the ease of understanding the pattern displayed.

Faceting (to facet the data into groups)

The previous graph was to chaotic, so we can facet the data into groups based on a categorical variable Now instead of having 5 different colored lines and dots oin a single graph We will have 5 graphs based on 5 categories of cut column

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() +geom_smooth(method='lm',se=FALSE) + facet_grid(.~cut)

Theme Layer for presentation purposes

Store the graph in a object and add labels theme

obj1 <- ggplot(diamonds, aes(x=cut, y=price, fill=color)) +geom_boxplot()
obj2 <- obj1 + labs(title='My Title',x='my x axis label',y='my y axis label',fill='my legends title')
obj2

Add theme layer to give a theme to the plot

obj3 <- obj2 + theme(panel.background = element_rect(fill='palegreen4'))
obj3

The font of title of the plot can be changed using the below code hjust to align the title at the center of the plot

obj4 <- obj3 + theme(plot.title = element_text(hjust = 0.5, face='bold', colour = 'red'))
obj4

Data mapped on y axis is price which is a continuous variable hence we can use scale_y_continuous function to make changes to the scale of y Similarly if it was categorical we should have used scale_y_discrete Here we want to scale down the outlier values and put the max of 10000

obj4+scale_y_continuous(limits=c(0,10000))
## Warning: Removed 5222 rows containing non-finite values (stat_boxplot).

Let us try theme on another plot.

g1<- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() +geom_smooth(method='lm',se=FALSE) + facet_grid(.~cut)
g1

g2 <- g1+theme(panel.background = element_rect(fill='grey'))
g2

Fill the legend with a background

g3<- g2 + theme(legend.background = element_rect(fill='grey'))
g3

Plot background

g3+theme(plot.background = element_rect(fill = 'green'))

Thus, we have used almost all basic building blocks of ggplot2 package and are good to go and visualize the world ourselves!

Happy reading,

Vibs!