Visualization is the most important aspect of any data analysis. Without proper visualization, our data might fail to tell the entire story. And if you are a newbie in the data science or analytics fields you will surely run into below doubts once or more often:
In most data analysis we broadly deal with a mix of categorical and numeric variables from source data.Dplyr and ggplot2 are the R packages we will use for data manipulation and graphics respectively and we will use the dataset “mpg” which comes with ggplot2 package.
“Mpg” contains fuel economy data from 1999 and 2008 for 38 popular model of car. First we will load the required packages and dataset in our R workspace(R-studio).
library(ggplot2)
library(dplyr)
data(mpg)
To get a quick overview of our (or for that matter any) dataset use
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer (chr) "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model (chr) "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ (dbl) 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year (int) 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl (int) 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans (chr) "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv (chr) "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty (int) 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy (int) 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl (chr) "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class (chr) "compact", "compact", "compact", "compact", "comp...
Remember that the following examples are intended to explain the general principles. You might need to adapt them to fit any additional requirements.
Use dotplot or histogram.
##Dot plot
ggplot(mpg, aes(cty)) +
geom_dotplot()
OR
##Histogram
ggplot(mpg, aes(cty)) +
geom_histogram(binwidth = 2)
Use scatterplot.
ggplot(mpg, aes(cty,hwy)) +
geom_point()
Use bar graph.
ggplot(mpg, aes(drv)) +
geom_bar()
Use the summary of numeric variable and plot as bar graph across category.
mpg %>% group_by(manufacturer) %>% summarise(avg_cty_mileage = mean(cty)) %>%
ggplot(aes(x = manufacturer, y = avg_cty_mileage)) +
geom_bar(stat = "identity")
##To change the co-ordinates
mpg %>% group_by(manufacturer) %>% summarise(avg_cty_mileage = mean(cty)) %>%
ggplot(aes(x = manufacturer, y = avg_cty_mileage)) +
geom_bar(stat = "identity") +
coord_flip()
Use bar graph.
ggplot(mpg, aes(class, fill = drv)) +
geom_bar()
OR
ggplot(mpg, aes(class, fill = drv)) +
geom_bar(position = "stack")
## To normalize the height
ggplot(mpg, aes(class, fill = drv)) +
geom_bar(position = "fill")
## Side by side
ggplot(mpg, aes(class, fill = drv)) +
geom_bar(position = "dodge")
Use faceting with bar graphs.
ggplot(mpg, aes(drv, fill = class)) + geom_bar() +
facet_grid(~fl , labeller = label_parsed)
OR
ggplot(mpg, aes(drv, fill = class)) + geom_bar() +
facet_wrap(~fl , ncol = 2)
Use faceting with dot-plot.
ggplot(mpg, aes(cty, hwy)) + geom_point() +
facet_grid(year~.)
ggplot(mpg, aes(cty,hwy)) +
geom_point() +
ggtitle("Mileage comparision") +
xlab("country") +
ylab("highway")
Use alpha and Jitter.
## see through points
ggplot(mpg, aes(cty,hwy)) +
geom_point(alpha = .2)
## jittering.
ggplot(mpg, aes(cty,hwy)) +
geom_point(alpha = .4, position = position_jitter(width= .1,height = .1))
Use theme.
## removes the grey background.
ggplot(mpg, aes(cty,hwy)) +
geom_point(alpha = .2) +
theme_bw()
## have minimum theme.
ggplot(mpg, aes(cty,hwy)) +
geom_point(alpha = .2) +
theme_minimal()