- Introduction to ggplot2 Package (with Scatter Plot)
- Bar Chart
- Pie Chart
- Tree Map
- Correlation Plot
ggplot is a powerful package for more advanced visualization.
The functions in ggplot package build up a group in layer: starting with a simple graph -> adding additional elements/functions with +
, one at a time.
We will first load a data from a package using the following code, then load the library of ggplot2
#install.packages("mosaicData) data(CPS85, package = "mosaicData") # load library library(ggplot2) head(CPS85,5)
## wage educ race sex hispanic south married exper union age sector ## 1 9.0 10 W M NH NS Married 27 Not 43 const ## 2 5.5 12 W M NH NS Married 20 Not 38 sales ## 3 3.8 12 W F NH NS Single 4 Not 22 sales ## 4 10.5 12 W F NH NS Married 29 Not 47 clerical ## 5 15.0 12 W M NH NS Married 40 Union 58 const
The first function in building a graph is ggplot
function. It specifies the - data frame containing the data to be plotted - the mapping of the variables to visual properties of the graph (where aes stands for aesthetics).
# specify dataset and mapping, ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) # This is how we specified "exper" as x-axis and "wage" as y axis
Geoms: geometric objects (points, lines, bars, etc.) of a graph. They are added using functions that start with geom_
. In this example, geom_point
function is used to create a scatterplot. In ggplot2
 graphs, functions are chained together using the +
 sign to build a final plot.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) + geom_point() #This is how we add datapoints to the plot
The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.
library(dplyr) plotdata <- filter(CPS85, wage < 40) #This is how we filter out the outlier
# redraw scatterplot ggplot(data = plotdata, # Then we used the updated dataset mapping = aes(x = exper, y = wage)) + geom_point()
Options for the geom_point
function include: 1) Point color: color
2)Point size: size
3) Transparency: alpha
. Ranging from 0 (complete transparent) to 1 (complete opaque) ## Parameter/Options of geom_ Function
# make points blue, larger, and semi-transparent ggplot(data = plotdata, mapping = aes(x = exper, y = wage)) + geom_point(color = "cornflowerblue", #Here we specify options of geom_ alpha = .7, size = 3)
Next, let’s add a line of best fit. We can do this with the geom_smooth
function. Options control
the type of line (linear, quadratic, nonparametric)
the thickness of the line - the line’s color
the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).
# add a line of best fit. ggplot(data = plotdata, mapping = aes(x = exper, y = wage)) + geom_point(color = "cornflowerblue",alpha = .7,size = 3) + geom_smooth(method = "lm") #Here we added another layer of smooth line of best fit and specified linear regression method
## `geom_smooth()` using formula = 'y ~ x'
Variables can be mapped to the color, shape, size, transparency, etc. This allows groups of observations to be superimposed in a single graph.
# indicate sex using color ggplot(data = plotdata, mapping = aes(x = exper, y = wage, color = sex)) + #Here we specified using sex variable to differentiate colors geom_point(alpha = .7, size = 3) + geom_smooth(method = "lm", se = FALSE, linewidth = 1.5)
It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.
To explore the relationship between age and income, create the following graph, age as x axis, and wage as y axis. Group the outcome based on their marital status (adjust the scatterplot’s color based on married
variable)
Scale functions (which start with scale
) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.
# modify the x and y axes and specify the colors to be used ggplot(data = plotdata, mapping = aes(x = exper, y = wage,color = sex)) + geom_point(alpha = .7, size = 3) + geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) + scale_x_continuous(breaks = seq(0, 60, 10)) + #Here we adjust the scales of axes scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) + scale_color_manual(values = c("indianred3", "cornflowerblue")) # Here we use this function to specify different colors of the trend lines
Facets reproduce a graph for each level of a given variable (or combination of variables). Facets functions start with facet_
. Here, facets are defined by the eight levels of the sector variable.
ggplot(data = plotdata, mapping = aes(x = exper, y = wage,color = sex)) + geom_point(alpha = .7) + geom_smooth(method = "lm", se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) + scale_color_manual(values = c("indianred3","cornflowerblue")) + facet_wrap(~sector) #This is the facet_ function, how we produce a graph for each sector
The labs
function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.
# add informative labels ggplot(data = plotdata,mapping = aes(x = exper, y = wage,color = sex)) + geom_point(alpha = .7) + geom_smooth(method = "lm", se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) + scale_color_manual(values = c("indianred3","cornflowerblue")) + facet_wrap(~sector) + labs(title = "Relationship between wages and experience", subtitle = "Current Population Survey", caption = "source: http://mosaic-web.org/", x = " Years of Experience", y = "Hourly Wage", color = "Gender")
The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map.
There are two types of bar charts: geom_bar()
and geom_col()
. geom_bar()
makes the height of the bar proportional to the number of cases in each group
The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama. Below, a bar chart is used to display the distribution of wedding participants by race.
library(ggplot2) data(Marriage, package = "mosaicData")
ggplot(Marriage, aes(x = race)) + geom_bar(fill = "cornflowerblue", color="black") + labs(x = "Race", y = "Frequency", title = "Participants by race")
If the bars’ height needs to represent values in the data, use geom_col()
instead.
ggplot(Marriage, aes(officialTitle, delay)) + geom_col(colour = "cornflowerblue",fill = "cornflowerblue") + labs(x = "Official Title", y = "Delay")
If your goal is compare each category with the the whole (e.g., what portion of participants are Hispanic compared to all participants), and the number of categories is small, then pie charts may work for you.
# create a basic ggplot2 pie chart plotdata <- Marriage %>% count(race) %>% arrange(desc(race)) %>% mutate(prop = round(n * 100 / sum(n), 1))
ggplot(plotdata, aes(x = "",y = prop, fill = race)) + geom_bar(width = 1, stat = "identity", color = "black") + coord_polar("y", start = 0, direction = -1) + theme_void()
Tree map is an alternative to pie chart, but it also can handle catgorical variables with many levels.
#install.packages("treemapify") library(treemapify) data(Marriage, package = "mosaicData") # create a treemap of marriage officials plotdata <- Marriage %>% count(officialTitle)
ggplot(plotdata, aes(fill = officialTitle, area = n)) + geom_treemap() + labs(title = "Marriages by officiate")
We can also add the categorical data as label for its area, we can
ggplot(plotdata, aes(fill = officialTitle, area = n, label = officialTitle)) + geom_treemap() + geom_treemap_text(colour = "white",place = "centre") + labs(title = "Marriages by officiate")
Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. We will first calculate correlation
#Load data data(SaratogaHouses, package="mosaicData") # select numeric variables df <- dplyr::select_if(SaratogaHouses, is.numeric) # calulate the correlations r <- cor(df, use="complete.obs") round(r,2)
## price lotSize age landValue livingArea pctCollege bedrooms ## price 1.00 0.16 -0.19 0.58 0.71 0.20 0.40 ## lotSize 0.16 1.00 -0.02 0.06 0.16 -0.03 0.11 ## age -0.19 -0.02 1.00 -0.02 -0.17 -0.04 0.03 ## landValue 0.58 0.06 -0.02 1.00 0.42 0.23 0.20 ## livingArea 0.71 0.16 -0.17 0.42 1.00 0.21 0.66 ## pctCollege 0.20 -0.03 -0.04 0.23 0.21 1.00 0.16 ## bedrooms 0.40 0.11 0.03 0.20 0.66 0.16 1.00 ## fireplaces 0.38 0.09 -0.17 0.21 0.47 0.25 0.28 ## bathrooms 0.60 0.08 -0.36 0.30 0.72 0.18 0.46 ## rooms 0.53 0.14 -0.08 0.30 0.73 0.16 0.67 ## fireplaces bathrooms rooms ## price 0.38 0.60 0.53 ## lotSize 0.09 0.08 0.14 ## age -0.17 -0.36 -0.08 ## landValue 0.21 0.30 0.30 ## livingArea 0.47 0.72 0.73 ## pctCollege 0.25 0.18 0.16 ## bedrooms 0.28 0.46 0.67 ## fireplaces 1.00 0.44 0.32 ## bathrooms 0.44 1.00 0.52 ## rooms 0.32 0.52 1.00
#install.packages("ggcorrplot") library(ggcorrplot) ggcorrplot(r) # Plot the correlation
We can also customize the output.
hc.order = TRUE
 reorders the variables, placing variables with similar correlation patterns together.
type = "lower"
 plots the lower portion of the correlation matrix.
lab = TRUE
 overlays the correlation coefficients (as text) on the plot.
ggcorrplot(r, hc.order = TRUE, type = "lower", lab = TRUE)