- Introduction to ggplot2 Package (with Scatter Plot)
- Bar Chart
- Tree Map
- Correlation Plot
ggplot is a powerful package for more advanced visualization.
The functions in ggplot package build up a group in layer: starting with a simple graph -> adding additional elements/functions with +, one at a time.
We will first load a data from a package using the following code, then load the library of ggplot2
install.packages('mosaicData')
library('mosaicData')
data(CPS85, package = "mosaicData")
# load library
library(ggplot2)
head(CPS85,5)
## wage educ race sex hispanic south married exper union age sector ## 1 9.0 10 W M NH NS Married 27 Not 43 const ## 2 5.5 12 W M NH NS Married 20 Not 38 sales ## 3 3.8 12 W F NH NS Single 4 Not 22 sales ## 4 10.5 12 W F NH NS Married 29 Not 47 clerical ## 5 15.0 12 W M NH NS Married 40 Union 58 const
The first function in building a graph is ggplot function. It specifies the - data frame containing the data to be plotted - the mapping of the variables to visual properties of the graph (where aes stands for aesthetics).
# specify dataset and mapping,
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage)) # specify x-axis and y axis
Geoms: geometric objects (points, lines, bars, etc.) of a graph. They are added using functions that start with geom_. In this example, geom_point function is used to create a scatterplot. In ggplot2 graphs, functions are chained together using the + sign to build a final plot.
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage)) +
geom_point() #This is how we add datapoints to the plot
The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.
library(dplyr) plotdata <- filter(CPS85, wage < 40) #filter out the outlier
# redraw scatterplot
ggplot(data = plotdata, # Then we used the updated dataset
mapping = aes(x = exper, y = wage)) +
geom_point()
Options for the geom_point function include: 1) Point color: color 2)Point size: size 3) Transparency: alpha. Ranging from 0 (complete transparent) to 1 (complete opaque) ## Parameter/Options of geom_ Function
# make points blue, larger, and semi-transparent
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue", #Color
alpha = .7, #Transparency level
size = 3) #Point Size
Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control
the type of line (linear, quadratic, nonparametric)
the thickness of the line - the line’s color
the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).
# add a line of best fit.
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",alpha = .7,size = 3) +
geom_smooth(method = "lm") #added another layer of smooth line of best fit and specified linear regression method
## `geom_smooth()` using formula = 'y ~ x'
Variables can be mapped to the color, shape, size, transparency, etc. This allows groups of observations to be superimposed in a single graph.
# indicate sex using color
ggplot(data = plotdata, mapping = aes(x = exper, y = wage, color = sex)) +
#using sex variable to differentiate colors
geom_point(alpha = .7, size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.5)
It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.
To explore the relationship between age and income, create the following graph, age as x axis, and wage as y axis. Group the outcome based on their marital status (adjust the scatterplot’s color based on married variable)
Scale functions (which start with scale) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.
# modify the x and y axes and specify the colors to be used
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage,color = sex)) +
geom_point(alpha = .7, size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
scale_x_continuous(breaks = seq(0, 60, 10)) + #adjust the scales of axes
scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) +
scale_color_manual(values = c("indianred3", "cornflowerblue"))
# specify different colors of the trend lines
Facets reproduce a graph for [each level]} of a given variable (or combination of variables). Facets functions start with facet_. Here, facets are defined by the eight levels of the sector variable.
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage,color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
scale_color_manual(values = c("indianred3","cornflowerblue")) +
facet_wrap(~sector) #This is the facet_ function, how we produce a graph for each sector
The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.
# add informative labels
ggplot(data = plotdata,mapping = aes(x = exper, y = wage,color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
scale_color_manual(values = c("indianred3","cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = " Years of Experience",
y = "Hourly Wage",
color = "Gender")
The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map.
The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama. Below, a bar chart is used to display the distribution of wedding participants by race.
library(ggplot2) data(Marriage, package = "mosaicData")
ggplot(Marriage, aes(x = race)) +
geom_bar(fill = "cornflowerblue",
color="black") +
labs(x = "Race",
y = "Frequency",
title = "Participants by race")
library(ggplot2)
library(tidyverse)
df = read_csv("HousePrices.csv")
ImageData <- df %>% group_by(air_cond) %>% summarise(n_house = n(), avg_price = mean(price)) ImageData
## # A tibble: 2 × 3 ## air_cond n_house avg_price ## <chr> <int> <dbl> ## 1 No 1093 187022. ## 2 Yes 635 254904.
ggplot(ImageData, aes(air_cond, avg_price)) +
geom_col(colour = "cornflowerblue",fill = "cornflowerblue") +
labs(x = "Whether house have AC", y = "Average Price",
title = "Price by AC")+
coord_cartesian(ylim=c(150000,270000))# set y limits
Tree map is an alternative to pie chart, but it also can handle catgorical variables with many levels.
#install.packages("treemapify")
library(treemapify)
data(Marriage, package = "mosaicData")
# create a treemap of marriage officials
plotdata <- Marriage %>%
count(officialTitle)
ggplot(plotdata, aes(fill = officialTitle,
area = n)) +
geom_treemap() +
labs(title = "Marriages by officiate")
We can also add the categorical data as label for its area, we can
ggplot(plotdata,
aes(fill = officialTitle, area = n, label = officialTitle)) +
geom_treemap() +
geom_treemap_text(colour = "white",place = "centre") +
labs(title = "Marriages by officiate")
Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. We will first calculate correlation
#Load data data(SaratogaHouses, package="mosaicData") # select numeric variables df <- dplyr::select_if(SaratogaHouses, is.numeric) # calulate the correlations r <- cor(df, use="complete.obs") round(r,2)
## price lotSize age landValue livingArea pctCollege bedrooms ## price 1.00 0.16 -0.19 0.58 0.71 0.20 0.40 ## lotSize 0.16 1.00 -0.02 0.06 0.16 -0.03 0.11 ## age -0.19 -0.02 1.00 -0.02 -0.17 -0.04 0.03 ## landValue 0.58 0.06 -0.02 1.00 0.42 0.23 0.20 ## livingArea 0.71 0.16 -0.17 0.42 1.00 0.21 0.66 ## pctCollege 0.20 -0.03 -0.04 0.23 0.21 1.00 0.16 ## bedrooms 0.40 0.11 0.03 0.20 0.66 0.16 1.00 ## fireplaces 0.38 0.09 -0.17 0.21 0.47 0.25 0.28 ## bathrooms 0.60 0.08 -0.36 0.30 0.72 0.18 0.46 ## rooms 0.53 0.14 -0.08 0.30 0.73 0.16 0.67 ## fireplaces bathrooms rooms ## price 0.38 0.60 0.53 ## lotSize 0.09 0.08 0.14 ## age -0.17 -0.36 -0.08 ## landValue 0.21 0.30 0.30 ## livingArea 0.47 0.72 0.73 ## pctCollege 0.25 0.18 0.16 ## bedrooms 0.28 0.46 0.67 ## fireplaces 1.00 0.44 0.32 ## bathrooms 0.44 1.00 0.52 ## rooms 0.32 0.52 1.00
library(ggplot2)
library(ggcorrplot)
ggcorrplot(r, colors = c("blue", "white", "red")) # Plot the correlation
We can also customize the output.
hc.order = TRUEÂ reorders the variables, placing variables with similar correlation patterns together.
type = "lower"Â plots the lower portion of the correlation matrix.
lab = TRUEÂ overlays the correlation coefficients (as text) on the plot.
ggcorrplot(r,
hc.order = TRUE,
type = "lower",
lab = TRUE)