Graphical Data Analysis (GDA) Understanding the philosophy of ggplot2: Bar plot, Pie chart, Histogram, Boxplot, Scatter plot, Regression plots Data visualization with ggplot2 package also termed as Grammar of Graphics is a free, open-source, and easy-to-use visualization package widely used in R. It is the most powerful visualization package written by Hadley Wickham. Building Blocks of layers with the grammar of graphics Data: The element is the data set itself Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type Geometrics: How our data being displayed using point, line, histogram, bar, boxplot Facets: It displays the subset of the data using Columns and rows Statistics: Binning, smoothing, descriptive, intermediate Coordinates: the space between data and display using Cartesian, fixed, polar, limits Themes: Non-data link
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##Data Layer: In the data Layer, the source of the information is to be visualized i.e the mtcars dataset in the ggplot2 package.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
#Data Layer
ggplot(data = iris) + labs(title ="iris Data Plot")
# Aesthetic Layer
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width , col = Petal.Length ))+labs(title = "iris Data Plot")
##control the essential elements, how our data being displayed using
point, line, histogram, bar, boxplot
# Geometric layer
ggplot(data = iris, aes(x =Sepal.Length , y = Sepal.Width, col =Petal.Length )) +
geom_point() +
labs(title = "Sepal.Length vs Sepal.Width", x = "Sepal.Length", y = "Sepal.Width")
##then plotting the Histogram plot # Adding size
ggplot(data = iris, aes(x = Sepal.Length, y =Sepal.Width , size= Petal.Length)) +
geom_point() +
labs(title = "Sepal.Length vs Sepal.Width", x = "Sepal.Length", y = "Sepal.Width")
# Adding shape and color
ggplot(data = iris, aes(x = Sepal.Length, y =Sepal.Width, col = factor(Petal.Length), shape = factor(Species))) +geom_point() +
labs(title = "Sepal.Length vs Sepal.Width", x = "Sepal.Length", y = "Sepal.Width")
categorical # A scatter plot
iris$Sepal.Length<-factor(iris$Sepal.Length)
ggplot(iris, aes(x = factor(Sepal.Length), y =Sepal.Width )) +
geom_point()
# Histogram plot
ggplot(data = iris, aes(x =Sepal.Width )) +
geom_histogram(binwidth = 5,color="black", fill="lightblue") +
labs(title = "Histogram of Sepal.Width", x = "Sepal.Width")
Boxplot – geom_boxplot() Scatter plot – geom_point()
ggplot(data = iris, aes(x=as.factor(Sepal.Length), fill=Sepal.Length)) +
geom_bar(stat="count")
boxplot creation ##majority of the cars (62%) have 2 or 4 Carbs. Other common carb frequency is 1 (22%), 3/6/8 carbs are very uncommon based on the dataset. Boxplot The distribution of a quantitative response variable and a categorical explanatory variable. The example below displays the distribution of gas mileage based on the number of cylinders.
bx <- ggplot(data = iris, aes(x = factor(Sepal.Length), y = Sepal.Width )) +
geom_boxplot(fill = "red") +
ggtitle("iris dataset") +
ylab("Sepal length") +
xlab("Sepal Width")
bx
Petal.Width = table(iris$Petal.Width)
data.labels = names(Petal.Width)
share = round(Petal.Width/sum(Petal.Width)*100)
data.labels = paste(data.labels, share)
data.labels = paste(data.labels,"%",sep="")
pie(Petal.Width,labels = data.labels,clockwise=TRUE, col=heat.colors(length(data.labels)), main="frequency of petal Width")
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.