The functions in the ggplot2 package build up a graph in layers. We’ll build a complex graph by starting with a simple graph and adding additional elements, one at a time.
The example explores the relationship between smoking, obesity, age, and medical costs using data from the Medical Insurance Costs dataset (Appendix A.4).
First, let’s import the data.
# load the data
url <- "https://tinyurl.com/mtktm8e5"
insurance <- read.csv(url)
View the data. What is the size of this data set?
head(insurance)
## age sex bmi children smoker region expenses
## 1 19 female 27.9 0 yes southwest 16884.92
## 2 18 male 33.8 1 no southeast 1725.55
## 3 28 male 33.0 3 no southeast 4449.46
## 4 33 male 22.7 0 no northwest 21984.47
## 5 32 male 28.9 0 no northwest 3866.86
## 6 31 female 25.7 0 no southeast 3756.62
Use your skills from the previous lesson to find the following:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
insurance %>%
group_by(sex) %>%
summarize( mean_BMI = mean(bmi))
## # A tibble: 2 × 2
## sex mean_BMI
## <chr> <dbl>
## 1 female 30.4
## 2 male 30.9
library(dplyr)
insurance %>%
group_by(region) %>%
summarize( mean_BMI = mean(bmi))
## # A tibble: 4 × 2
## region mean_BMI
## <chr> <dbl>
## 1 northeast 29.2
## 2 northwest 29.2
## 3 southeast 33.4
## 4 southwest 30.6
library(dplyr)
insurance %>%
group_by(sex) %>%
summarize( mean_BMI = sum(smoker == "yes")/length(smoker))
## # A tibble: 2 × 2
## sex mean_BMI
## <chr> <dbl>
## 1 female 0.174
## 2 male 0.235
library(dplyr)
insurance %>%
group_by(region) %>%
summarize( mean_BMI = sum(smoker == "yes")/length(smoker))
## # A tibble: 4 × 2
## region mean_BMI
## <chr> <dbl>
## 1 northeast 0.207
## 2 northwest 0.178
## 3 southeast 0.25
## 4 southwest 0.178
What else can you think of the might be interesting to find in this data? Find it!
Next, we’ll add a variable indicating if the patient is obese or not. Obesity will be defined as a body mass index greater than or equal to 30.
# create an obesity variable
insurance$obese <- ifelse(insurance$bmi >= 30,
"obese", "not obese")
Add a variable indicating if the patient has kids or not.
insurance$has_kids <- ifelse(insurance$children > 0, TRUE, FALSE)
The first function in building a graph is the ggplot function. It specifies the data frame to be used and the mapping of the variables to the visual properties of the graph. The mappings are placed within the aes function, which stands for aesthetics. Let’s start by looking at the relationship between age and medical expenses.
# specify dataset and mapping
library(ggplot2)
ggplot(data = insurance,
mapping = aes(x = age, y = expenses))
Why is the graph empty? We specified that the age variable should be mapped to the x-axis and that the expenses should be mapped to the y-axis, but we haven’t yet specified what we wanted placed on the graph.
In ggplot2 graphs, functions are chained together using the + sign to build a final plot.
# add points
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point()
What do you notice?
The data points make it look like there are three different categories, anf the older you get the more you spend.
A number of parameters (options) can be specified in a geom_ function. Options for the geom_point function include color, size, and alpha. These control the point color, size, and transparency, respectively.
# make points blue, larger, and semi-transparent
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point(color = "cornflowerblue",
alpha = .5,
size = 2.5)
Transparency ranges from 0 (completely transparent) to 1 (completely
opaque), and is specified using alpha.
# make points blue, larger, and semi-transparent
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point(color = "cornflowerblue",
alpha = .2,
size = 2.5)
Which of the previous graphs is more explanatory? Why?
The alpha graph with transparency and color is more descriptive. This is because it allows us to see density better within the graph.
Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).
# add a line of best fit.
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point(color = "cornflowerblue",
alpha = .5,
size = 2) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
By default, the confidence interval for the curve is shown. We can remove it as follows.
# add a line of best fit.
ggplot(data = insurance,
mapping = aes(x = age, y = expenses)) +
geom_point(color = "cornflowerblue",
alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
In addition to mapping variables to the x and y axes, variables can be mapped to the color, shape, size, transparency, and other visual characteristics of geometric objects. This allows groups of observations to be superimposed in a single graph.
Let’s add sex to the plot and represent it by color.
# indicate sex using color
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = sex)) +
geom_point(alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The color = sex option is placed in the aes function because we are mapping a variable to an aesthetic (a visual characteristic of the graph). The geom_smooth option (se = FALSE) was added to suppresses the confidence intervals.
What do you notice in the above graph?
That they have the same trend, but men spend more/
Instead of sex, let’s add smoker status and represent it by color.
# indicate sex using color
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
## `geom_smooth()` using formula = 'y ~ x'
scale_color_manual(values = c("green",
"orange"))
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: colour
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## get_transformation: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: grey50
## name: waiver
## palette: function
## palette.cache: NULL
## position: left
## range: environment
## rescale: function
## reset: function
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
What else do you think we could improve on this graph?
Possibly using triangles rather than dots.
Scales control how variables are mapped to the visual characteristics
of the plot. Scale functions (which start with scale_)
allow you to modify this mapping. In the next plot, we’ll change the x
and y axis scaling, and the colors employed.
# modify the x and y axes and specify the colors to be used
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue"))
## `geom_smooth()` using formula = 'y ~ x'
We’re getting there. Here is a question. Is the relationship between age, expenses, and smoking the same for obese and non-obese patients? Let’s repeat this graph once for each weight status in order to explore this.
Facets reproduce a graph for each level of a given variable (or pair
of variables). Facets are created using functions that start with
facet_. Here, facets will be defined by the two levels of
the obese variable.
# reproduce plot for each obsese and non-obese individuals
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~obese)
## `geom_smooth()` using formula = 'y ~ x'
What do you observe?
Obese people spend the most on smoking.
Graphs should be easy to interpret and informative labels are a key element in achieving this goal. The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.
# add informative labels
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~obese) +
labs(title = "Relationship between patient demographics and medical costs",
subtitle = "US Census Bureau 2013",
caption = "source: http://mosaic-web.org/",
x = " Age (years)",
y = "Annual expenses",
color = "Smoker?")
## `geom_smooth()` using formula = 'y ~ x'
Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph. Let’s use a cleaner theme.
# use a minimalist theme
ggplot(data = insurance,
mapping = aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 70, 10)) +
scale_y_continuous(breaks = seq(0, 60000, 20000),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~obese) +
labs(title = "Relationship between age and medical expenses",
subtitle = "US Census Data 2013",
caption = "source: https://github.com/dataspelunking/MLwR",
x = " Age (years)",
y = "Medical Expenses",
color = "Smoker?") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
From this graph, we can make a few observations:
Plots created with ggplot2 always start with the ggplot function. In the examples above, the data and mapping options were placed in this function. In this case they apply to each geom_ function that follows. You can also place these options directly within a geom. In that case, they only apply to that specific geom.
# placing color mapping in the ggplot function
ggplot(insurance,
aes(x = age,
y = expenses,
color = smoker)) +
geom_point(alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
## `geom_smooth()` using formula = 'y ~ x'
# placing color mapping in the geom_point function
ggplot(insurance,
aes(x = age,
y = expenses)) +
geom_point(aes(color = smoker),
alpha = .5,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
## `geom_smooth()` using formula = 'y ~ x'
What is the difference between the two graphs.
The one without the aesthetic has two best fit lines, whereas the one with the aesthetic has one line of best fit.
A few notes:
data= and mapping= can be
omitted since the first option always refers to data and the second
option always refers to mapping.A ggplot2 graph can be saved as a named R object (like a data frame), manipulated further, and then printed or saved to disk.
# create a scatterplot and save it
myplot <- ggplot(data = insurance,
aes(x = age, y = expenses)) +
geom_point()
# plot the graph
myplot
# make the points larger and blue
# then print the graph
myplot <- myplot + geom_point(size = 2, color = "blue")
myplot
# print the graph with a title and line of best fit
# but don't save those changes
myplot + geom_smooth(method = "lm") +
labs(title = "Mildly interesting graph")
## `geom_smooth()` using formula = 'y ~ x'
# print the graph with a black-and-white theme
# but don't save those changes
myplot + theme_bw()
This can be a real time saver. It is also handy when saving graphs programmatically.
loan50.csv dataset.Please refer to the following for more description of the variables
in this data set.
https://www.openintro.org/data/index.php?data=loan50
loans = read.csv("loan50.csv")
head(loans)
## state emp_length term homeownership annual_income verified_income
## 1 NJ 3 60 rent 59000 Not Verified
## 2 CA 10 36 rent 60000 Not Verified
## 3 SC NA 36 mortgage 75000 Verified
## 4 CA 0 36 rent 75000 Not Verified
## 5 OH 4 60 mortgage 254000 Not Verified
## 6 IN 6 36 mortgage 67000 Source Verified
## debt_to_income total_credit_limit total_credit_utilized
## 1 0.5575254 95131 32894
## 2 1.3056833 51929 78341
## 3 1.0562800 301373 79221
## 4 0.5743467 59890 43076
## 5 0.2381496 422619 60490
## 6 1.0770448 349825 72162
## num_cc_carrying_balance loan_purpose loan_amount grade interest_rate
## 1 8 debt_consolidation 22000 B 10.90
## 2 2 credit_card 6000 B 9.92
## 3 14 debt_consolidation 25000 E 26.30
## 4 10 credit_card 6000 B 9.92
## 5 2 home_improvement 25000 B 9.43
## 6 4 home_improvement 6400 B 9.92
## public_record_bankrupt loan_status has_second_income total_income
## 1 0 Current FALSE 59000
## 2 1 Current FALSE 60000
## 3 0 Current FALSE 75000
## 4 0 Current FALSE 75000
## 5 0 Current FALSE 254000
## 6 0 Current FALSE 67000
Term and public record of bankrupt.
loans <- mutate(loans,
term = as.factor(term),
public_record_bankrupt = as.factor(term))
loans <- mutate(loans,
across( c(term, public_record_bankrupt), as.factor))
These variables are now numerical.
For each, determine which factors can be used to better understand this relationship (e.g. loan grade, term, homeowership, whether income was verified, etc.).
p = ggplot( data = loans,
mapping = aes( x = annual_income,
y = total_credit_limit,
color = term)) +
geom_point()
library(scales)
p + scale_x_continuous(labels = unit_format(unit = "k", scale = 1e-3)) +
scale_y_continuous(labels = unit_format(unit = "k", scale = 1e-3))
library(scales)
p + scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = unit_format(unit = "$")) +
labs( title = "Total Credit Limit vs Annual Income",
x = "Annual Income ($)",
y = "Total Credit Limit ($)")
p = ggplot( data = loans,
mapping = aes( x = total_credit_limit,
y = interest_rate,
color = public_record_bankrupt)) +
geom_point()
library(scales)
p + scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = unit_format(unit = "%")) +
labs( title = "Interest Rate vs Total Credit Limit",
x = "Total Credit Limit ($)",
y = "Interest Rate (%)")
p = ggplot( data = loans,
mapping = aes( x = annual_income,
y = interest_rate,
color = homeownership)) +
geom_point()
library(scales)
p + scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = unit_format(unit = "%")) +
labs( title = "Interest Rate vs Annual Income",
x = "Annual Income ($)",
y = "Interest Rate (%)")
In the first graph I noticed that the people with the 36 and 60 month terms earn around the same amount of money. I also noticed that their total credit limits are around the same with a couple exceptions of outliers.
In the second graph I noticed that there were a lot more people declaring bankrupt from the annual income of $50,000-$140,000. There were also a few outliers in this graph but the majority of people who went bankrupt were in the same cluster of annual income.
In the third graph I noticed that people with mortgage rates ranged from high to low interest rates. The people who rented also had higher interest rates, but some also had low interest rates. The people who are owners are in the middle of the of the scatter plot. The most common interest rate appears to be around 11%.