##Graphical Data Analysis (GDA) Understanding the philosophy of ggplot2: Bar plot, Pie chart, Histogram, Boxplot, Scatter plot, Regression plots Data visualization with ggplot2 package also termed as Grammar of Graphics is a free, open-source, and easy-to-use visualization package widely used in R. It is the most powerful visualization package written by Hadley Wickham. Building Blocks of layers with the grammar of graphics Data: The element is the data set itself Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type Geometrics: How our data being displayed using point, line, histogram, bar, boxplot Facets: It displays the subset of the data using Columns and rows Statistics: Binning, smoothing, descriptive, intermediate Coordinates: the space between data and display using Cartesian, fixed, polar, limits Themes: Non-data link

Dataset Used mtcars(motor trend car road test) comprise fuel consumption and 10 aspects of automobile design and performance for 32 automobiles and come pre-installed with dplyr package in R.

options(repos = list(CRAN="http://cran.rstudio.com/"))
# Installing the package
#install.packages('plyr', repos = "http://cran.us.r-project.org")
#install.packages("dplyr")
# Loading package
#library(dplyr)
# Summary of dataset in package
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
# Explore the mtcars data frame with str()
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Data Layer: In the data Layer, the source of the information is to be visualized i.e the mtcars dataset in the ggplot2 package.

library(ggplot2)
#Data Layer
ggplot(data = mtcars) + labs(title ="MTCars Data Plot")

Aesthetic Layer: Display and map dataset into certain aesthetics.

# Aesthetic Layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp))+labs(title = "MTCars Data Plot")

Geometric layer: In geometric layer control the essential elements, how our data being displayed using point, line, histogram, bar, boxplot

# Geometric layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) +
  geom_point() +
  labs(title = "Miles per Gallon vs Horsepower", x = "Horsepower", y = "Miles per Gallon")

Adding Size, color, and shape and then plotting the Histogram plot # Adding size

ggplot(data = mtcars, aes(x = hp, y = mpg, size = disp)) +
geom_point() +
labs(title = "Miles per Gallon vs Horsepower", x = "Horsepower", y = "Miles per Gallon")

# Adding shape and color
ggplot(data = mtcars, aes(x = hp, y = mpg, col = factor(cyl), shape = factor(am))) +geom_point() +
labs(title = "Miles per Gallon vs Horsepower", x = "Horsepower", y = "Miles per Gallon")

cyl (the number of cylinders) as categorical # A scatter plot

mtcars$cyl<-factor(mtcars$cyl)
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_point()

# Histogram plot
ggplot(data = mtcars, aes(x = hp)) +
geom_histogram(binwidth = 5,color="black", fill="lightblue") +
labs(title = "Histogram of Horsepower", x = "Horsepower", y = "Count")

Bar plot – geom_bar() Pie chart – pie Boxplot – geom_boxplot() Scatter plot – geom_point()

ggplot(data = mtcars, aes(x=as.factor(cyl), fill=cyl)) + 
       geom_bar(stat="count")

pie chart fill colors scale_fill_manual() : to use custom colors scale_fill_brewer() : to use color palettes from RColorBrewer package scale_fill_grey() : to use grey color palettes Create a bar graph, that shows the number of each gear type in mtcarsand write a brief summary.

gear.type = table(mtcars$gear)
cyl.gear = table(mtcars$cyl, mtcars$gear)
barplot(gear.type, main="No of Gears Frequency", xlab="No. of Gears",ylab="Frequency of Cars",names.arg=names(gear.type),col=c("blue","green","yellow"),legend = rownames(cyl.gear))

## 3 and 4 gear cars are more common than 5 gear vehicles but there are six 5-gear cars.

Pie chart

pie chart showing the proportion of cars from the mtcars data set that have different carb values and a brief summary.

carb = table(mtcars$carb)
data.labels = names(carb)
share = round(carb/sum(carb)*100)
data.labels = paste(data.labels, share)
data.labels = paste(data.labels,"%",sep="") 
pie(carb,labels = data.labels,clockwise=TRUE, col=heat.colors(length(data.labels)), main="Frequency of Carb value")

## The chart below shows that majority of the cars (62%) have 2 or 4 Carbs. Other common carb frequency is 1 (22%), 3/6/8 carbs are very uncommon based on the dataset. Boxplot The distribution of a quantitative response variable and a categorical explanatory variable. The example below displays the distribution of gas mileage based on the number of cylinders.

bx <- ggplot(data = mtcars, aes(x = factor(cyl), y = mpg )) + 
  geom_boxplot(fill = "blue") + 
  ggtitle("Distribution of Gas Mileage") +
  ylab("MPG") + 
  xlab("Cylinders") 
bx

creating linear models with lm for each subset of data

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars)

# Basic plot
mtcars$cyl <- as.factor(mtcars$cyl)
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)

# Call abline() with carModel as first argument and set lty to 2
abline(carModel, lty = 2)

ggplot(mtcars, aes(x = as.factor(gear), y = mpg, col = gear)) +
  geom_jitter() +
  facet_grid(. ~ gear)