We will use 2 datasets for this purpose:
I have used the below video from edureka as an inspiration but tried the plots on a different dataset to just try it out with a little bit of customization from my side.
Edureka Tutorial link
Let us start with the basic graphs using iris dataset
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
ggplot(iris, aes(y=Petal.Length, x= Petal.Width)) + geom_point()
Add color to the aesthetics. it can be added on categorical variable and legend will be auto populated
ggplot(iris, aes(y=Petal.Length, x= Petal.Width, col=Species)) + geom_point()
Add shape to the aesthetics. it can be added on categorical variable and legend will be auto populated
ggplot(iris, aes(y=Petal.Length, x= Petal.Width, col=Species, shape=Species)) + geom_point()
Let us move to diamonds dataset to move to the intermediate level of plots
Histogram is used for visualizing continuous variable distribution.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Add the number of bins or number of histograms needed
ggplot(diamonds, aes(x=price)) +geom_histogram(bins=50)
Add color using fill and col to give boundary colors
ggplot(diamonds, aes(x=price)) +geom_histogram(bins=50, fill='palegreen4', col='red')
Use fill as an aesthetic (here color is a column name in daiamonds dateset). fill can be used on categorical variables
ggplot(diamonds, aes(x=price,fill=color)) +geom_histogram(bins=50)
If we use fill as an attribute in geom_histogram it will override the aesthtic values
ggplot(diamonds, aes(x=price,fill=color)) +geom_histogram(bins=50,fill='palegreen4', col='red')
If we use position =‘fill’ in the geom_histogram function then we get the proportion instead of the count
ggplot(diamonds, aes(x=price,fill=color)) +geom_histogram(bins=50, position='fill')
Bar plot is used for visualizing categorical variable distribution
Put a categorical variable in aesthetic
ggplot(diamonds, aes(x=cut)) +geom_bar()
Use fill as an aesthetic (here clarity is a column name in daiamonds dateset). fill can be used on categorical variables Thus, we can visualize 2 categorical variables in a single graph.
ggplot(diamonds, aes(x=cut, fill=clarity)) +geom_bar(position = 'fill')
Frequency Polygon as an alternative to histogram for continuous variable distribution
ggplot(diamonds, aes(x=price)) +geom_freqpoly(bins = 50)
Play with the size of the lines
ggplot(diamonds, aes(x=price)) +geom_freqpoly(bins = 50, size=2)
Add multiple frequency lines based on a catgorical variable using col in the aesthetics
ggplot(diamonds, aes(x=price, col=cut)) +geom_freqpoly(bins = 50, size=1)
Box Plots to understand how does a continuous variable change w.r.t a categorical variable
ggplot(diamonds, aes(x=factor(carat), y=price)) +geom_boxplot()
How does price (continuous variable) change w.r.t. cut (categorical variable) of the diamonds
ggplot(diamonds, aes(x=cut, y=price)) +geom_boxplot()
multi-variate analysis: add color to use 1 more categorical variable. use fill in the aesthetic on the color column
ggplot(diamonds, aes(x=cut, y=price, fill=color)) +geom_boxplot()
Smooth line for continuous variable to continuous variable analysis. grey area shown is the error.
ggplot(diamonds, aes(x=carat, y=price)) +geom_smooth()
## `geom_smooth()` using method = 'gam'
We can remove the error by using se=FALSE
ggplot(diamonds, aes(x=carat, y=price)) +geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam'
Add multiple lines using a categorical variable with the color aesthetic
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam'
method = lm means linear model Use both point and smooth geometry
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() +geom_smooth(method='lm',se=FALSE)
And we can add a categorical variable as well by using color
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() +geom_smooth(method='lm',se=FALSE)
Added 1 more categorical variable using shape aesthetic
ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=clarity)) + geom_point() +geom_smooth(method='lm',se=FALSE)
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 8.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 5445 rows containing missing values (geom_point).
This one became too messy. Maybe we should try and ignore putting so many variables in a single plot for the ease of understanding the pattern displayed.
The previous graph was to chaotic, so we can facet the data into groups based on a categorical variable Now instead of having 5 different colored lines and dots oin a single graph We will have 5 graphs based on 5 categories of cut column
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() +geom_smooth(method='lm',se=FALSE) + facet_grid(.~cut)
Store the graph in a object and add labels theme
obj1 <- ggplot(diamonds, aes(x=cut, y=price, fill=color)) +geom_boxplot()
obj2 <- obj1 + labs(title='My Title',x='my x axis label',y='my y axis label',fill='my legends title')
obj2
Add theme layer to give a theme to the plot
obj3 <- obj2 + theme(panel.background = element_rect(fill='palegreen4'))
obj3
The font of title of the plot can be changed using the below code hjust to align the title at the center of the plot
obj4 <- obj3 + theme(plot.title = element_text(hjust = 0.5, face='bold', colour = 'red'))
obj4
Data mapped on y axis is price which is a continuous variable hence we can use scale_y_continuous function to make changes to the scale of y Similarly if it was categorical we should have used scale_y_discrete Here we want to scale down the outlier values and put the max of 10000
obj4+scale_y_continuous(limits=c(0,10000))
## Warning: Removed 5222 rows containing non-finite values (stat_boxplot).
Let us try theme on another plot.
g1<- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() +geom_smooth(method='lm',se=FALSE) + facet_grid(.~cut)
g1
g2 <- g1+theme(panel.background = element_rect(fill='grey'))
g2
Fill the legend with a background
g3<- g2 + theme(legend.background = element_rect(fill='grey'))
g3
Plot background
g3+theme(plot.background = element_rect(fill = 'green'))
Thus, we have used almost all basic building blocks of ggplot2 package and are good to go and visualize the world ourselves!
Happy reading,
Vibs!