##Graphical Data Analysis (GDA) Understanding the philosophy of ggplot2: Bar plot, Pie chart, Histogram, Boxplot, Scatter plot, Regression plots Data visualization with ggplot2 package also termed as Grammar of Graphics is a free, open-source, and easy-to-use visualization package widely used in R. It is the most powerful visualization package written by Hadley Wickham. Building Blocks of layers with the grammar of graphics Data: The element is the data set itself Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type Geometrics: How our data being displayed using point, line, histogram, bar, boxplot Facets: It displays the subset of the data using Columns and rows Statistics: Binning, smoothing, descriptive, intermediate Coordinates: the space between data and display using Cartesian, fixed, polar, limits Themes: Non-data link
Dataset Used mtcars(motor trend car road test) comprise fuel consumption and 10 aspects of automobile design and performance for 32 automobiles and come pre-installed with dplyr package in R.
options(repos = list(CRAN="http://cran.rstudio.com/"))
# Installing the package
#install.packages('plyr', repos = "http://cran.us.r-project.org")
#install.packages("dplyr")
# Loading package
#library(dplyr)
# Summary of dataset in package
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
# Explore the mtcars data frame with str()
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Data Layer: In the data Layer, the source of the information is to be visualized i.e the mtcars dataset in the ggplot2 package.
library(ggplot2)
#Data Layer
ggplot(data = mtcars) + labs(title ="MTCars Data Plot")
Aesthetic Layer: Display and map dataset into certain aesthetics.
# Aesthetic Layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp))+labs(title = "MTCars Data Plot")
Geometric layer: In geometric layer control the essential elements, how
our data being displayed using point, line, histogram, bar, boxplot
# Geometric layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) +
geom_point() +
labs(title = "Miles per Gallon vs Horsepower", x = "Horsepower", y = "Miles per Gallon")
Adding Size, color, and shape and then plotting the Histogram plot #
Adding size
ggplot(data = mtcars, aes(x = hp, y = mpg, size = disp)) +
geom_point() +
labs(title = "Miles per Gallon vs Horsepower", x = "Horsepower", y = "Miles per Gallon")
# Adding shape and color
ggplot(data = mtcars, aes(x = hp, y = mpg, col = factor(cyl), shape = factor(am))) +geom_point() +
labs(title = "Miles per Gallon vs Horsepower", x = "Horsepower", y = "Miles per Gallon")
cyl (the number of cylinders) as categorical # A scatter plot
mtcars$cyl<-factor(mtcars$cyl)
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_point()
# Histogram plot
ggplot(data = mtcars, aes(x = hp)) +
geom_histogram(binwidth = 5,color="black", fill="lightblue") +
labs(title = "Histogram of Horsepower", x = "Horsepower", y = "Count")
Bar plot – geom_bar() Pie chart – pie Boxplot – geom_boxplot() Scatter
plot – geom_point()
ggplot(data = mtcars, aes(x=as.factor(cyl), fill=cyl)) +
geom_bar(stat="count")
pie chart fill colors scale_fill_manual() : to use custom colors
scale_fill_brewer() : to use color palettes from RColorBrewer package
scale_fill_grey() : to use grey color palettes Create a bar graph, that
shows the number of each gear type in mtcarsand write a brief
summary.
gear.type = table(mtcars$gear)
cyl.gear = table(mtcars$cyl, mtcars$gear)
barplot(gear.type, main="No of Gears Frequency", xlab="No. of Gears",ylab="Frequency of Cars",names.arg=names(gear.type),col=c("blue","green","yellow"),legend = rownames(cyl.gear))
## 3 and 4 gear cars are more common than 5 gear vehicles but there are
six 5-gear cars.
Pie chart
pie chart showing the proportion of cars from the mtcars data set that have different carb values and a brief summary.
carb = table(mtcars$carb)
data.labels = names(carb)
share = round(carb/sum(carb)*100)
data.labels = paste(data.labels, share)
data.labels = paste(data.labels,"%",sep="")
pie(carb,labels = data.labels,clockwise=TRUE, col=heat.colors(length(data.labels)), main="Frequency of Carb value")
## The chart below shows that majority of the cars (62%) have 2 or 4
Carbs. Other common carb frequency is 1 (22%), 3/6/8 carbs are very
uncommon based on the dataset. Boxplot The distribution of a
quantitative response variable and a categorical explanatory variable.
The example below displays the distribution of gas mileage based on the
number of cylinders.
bx <- ggplot(data = mtcars, aes(x = factor(cyl), y = mpg )) +
geom_boxplot(fill = "blue") +
ggtitle("Distribution of Gas Mileage") +
ylab("MPG") +
xlab("Cylinders")
bx
creating linear models with lm for each subset of data
# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars)
# Basic plot
mtcars$cyl <- as.factor(mtcars$cyl)
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
# Call abline() with carModel as first argument and set lty to 2
abline(carModel, lty = 2)
ggplot(mtcars, aes(x = as.factor(gear), y = mpg, col = gear)) +
geom_jitter() +
facet_grid(. ~ gear)