3.1 A worked example

The functions in the ggplot2 package build up a graph in layers. We’ll build a complex graph by starting with a simple graph and adding additional elements, one at a time.

The example explores the relationship between smoking, obesity, age, and medical costs using data from the Medical Insurance Costs dataset (Appendix A.4).

Importing Data

First, let’s import the data.

# load the data
url <- "https://tinyurl.com/mtktm8e5"
insurance <- read.csv(url)

View the data. What is the size of this data set?

head(insurance)

##   age    sex  bmi children smoker    region expenses
## 1  19 female 27.9        0    yes southwest 16884.92
## 2  18   male 33.8        1     no southeast  1725.55
## 3  28   male 33.0        3     no southeast  4449.46
## 4  33   male 22.7        0     no northwest 21984.47
## 5  32   male 28.9        0     no northwest  3866.86
## 6  31 female 25.7        0     no southeast  3756.62

Use your skills from the previous lesson to find the following:

Average BMI by sex

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

insurance %>%
  group_by(sex) %>%
  summarize( mean_BMI = mean(bmi))

## # A tibble: 2 × 2
##   sex    mean_BMI
##   <chr>     <dbl>
## 1 female     30.4
## 2 male       30.9

Average BMI by region

library(dplyr)
insurance %>%
  group_by(region) %>%
  summarize( mean_BMI = mean(bmi))

## # A tibble: 4 × 2
##   region    mean_BMI
##   <chr>        <dbl>
## 1 northeast     29.2
## 2 northwest     29.2
## 3 southeast     33.4
## 4 southwest     30.6

Proportion of smokers by sex

library(dplyr)
insurance %>%
  group_by(sex) %>%
  summarize( mean_BMI = sum(smoker == "yes")/length(smoker))

## # A tibble: 2 × 2
##   sex    mean_BMI
##   <chr>     <dbl>
## 1 female    0.174
## 2 male      0.235

Proportion of smokers by region

library(dplyr)
insurance %>%
  group_by(region) %>%
  summarize( mean_BMI = sum(smoker == "yes")/length(smoker))

## # A tibble: 4 × 2
##   region    mean_BMI
##   <chr>        <dbl>
## 1 northeast    0.207
## 2 northwest    0.178
## 3 southeast    0.25 
## 4 southwest    0.178

What else can you think of the might be interesting to find in this data? Find it!

Adding a variable

Next, we’ll add a variable indicating if the patient is obese or not. Obesity will be defined as a body mass index greater than or equal to 30.

# create an obesity variable
insurance$obese <- ifelse(insurance$bmi >= 30, 
                          "obese", "not obese")

Add a variable indicating if the patient has kids or not.

insurance$has_kids <- ifelse(insurance$children > 0, TRUE, FALSE)

ggplot

The first function in building a graph is the ggplot function. It specifies the data frame to be used and the mapping of the variables to the visual properties of the graph. The mappings are placed within the aes function, which stands for aesthetics. Let’s start by looking at the relationship between age and medical expenses.

# specify dataset and mapping
library(ggplot2)
ggplot(data = insurance,
       mapping = aes(x = age, y = expenses))

Why is the graph empty? We specified that the age variable should be mapped to the x-axis and that the expenses should be mapped to the y-axis, but we haven’t yet specified what we wanted placed on the graph.

In ggplot2 graphs, functions are chained together using the + sign to build a final plot.

# add points
ggplot(data = insurance,
       mapping = aes(x = age, y = expenses)) +
  geom_point()

What do you notice?

The data points make it look like there are three different categories, anf the older you get the more you spend.

geom_point

A number of parameters (options) can be specified in a geom_ function. Options for the geom_point function include color, size, and alpha. These control the point color, size, and transparency, respectively.

# make points blue, larger, and semi-transparent
ggplot(data = insurance,
       mapping = aes(x = age, y = expenses)) +
  geom_point(color = "cornflowerblue",
             alpha = .5,
             size = 2.5)

Transparency ranges from 0 (completely transparent) to 1 (completely opaque), and is specified using alpha.

# make points blue, larger, and semi-transparent
ggplot(data = insurance,
       mapping = aes(x = age, y = expenses)) +
  geom_point(color = "cornflowerblue",
             alpha = .2,
             size = 2.5)

Which of the previous graphs is more explanatory? Why?

The alpha graph with transparency and color is more descriptive. This is because it allows us to see density better within the graph.

geom_smooth

Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).

# add a line of best fit.
ggplot(data = insurance,
       mapping = aes(x = age, y = expenses)) +
  geom_point(color = "cornflowerblue",
             alpha = .5,
             size = 2) +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

By default, the confidence interval for the curve is shown. We can remove it as follows.

# add a line of best fit.
ggplot(data = insurance,
       mapping = aes(x = age, y = expenses)) +
  geom_point(color = "cornflowerblue",
             alpha = .5,
             size = 2) +
  geom_smooth(method = "lm",
              se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Grouping

In addition to mapping variables to the x and y axes, variables can be mapped to the color, shape, size, transparency, and other visual characteristics of geometric objects. This allows groups of observations to be superimposed in a single graph.

Let’s add sex to the plot and represent it by color.

# indicate sex using color
ggplot(data = insurance,
       mapping = aes(x = age, 
                     y = expenses,
                     color = sex)) +
  geom_point(alpha = .5,
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE, 
              size = 1.5)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

The color = sex option is placed in the aes function because we are mapping a variable to an aesthetic (a visual characteristic of the graph). The geom_smooth option (se = FALSE) was added to suppresses the confidence intervals.

What do you notice in the above graph?

That they have the same trend, but men spend more/

Instead of sex, let’s add smoker status and represent it by color.

# indicate sex using color
ggplot(data = insurance,
       mapping = aes(x = age, 
                     y = expenses,
                     color = smoker)) +
  geom_point(alpha = .5,
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE, 
              size = 1.5)

## `geom_smooth()` using formula = 'y ~ x'

  scale_color_manual(values = c("green",
                                "orange"))

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     get_transformation: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: grey50
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: environment
##     rescale: function
##     reset: function
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

What else do you think we could improve on this graph?

Possibly using triangles rather than dots.

Scales

Scales control how variables are mapped to the visual characteristics of the plot. Scale functions (which start with scale_) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.

# modify the x and y axes and specify the colors to be used
ggplot(data = insurance,
       mapping = aes(x = age, 
                     y = expenses,
                     color = smoker)) +
  geom_point(alpha = .5,
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE, 
              size = 1.5) +
  scale_x_continuous(breaks = seq(0, 70, 10)) +
  scale_y_continuous(breaks = seq(0, 60000, 20000),
                     label = scales::dollar) +
  scale_color_manual(values = c("indianred3", 
                                "cornflowerblue"))

## `geom_smooth()` using formula = 'y ~ x'

We’re getting there. Here is a question. Is the relationship between age, expenses, and smoking the same for obese and non-obese patients? Let’s repeat this graph once for each weight status in order to explore this.

Facets

Facets reproduce a graph for each level of a given variable (or pair of variables). Facets are created using functions that start with facet_. Here, facets will be defined by the two levels of the obese variable.

# reproduce plot for each obsese and non-obese individuals
ggplot(data = insurance,
       mapping = aes(x = age, 
                     y = expenses,
                     color = smoker)) +
  geom_point(alpha = .5) +
  geom_smooth(method = "lm", 
              se = FALSE) +
  scale_x_continuous(breaks = seq(0, 70, 10)) +
  scale_y_continuous(breaks = seq(0, 60000, 20000),
                     label = scales::dollar) +
  scale_color_manual(values = c("indianred3", 
                                "cornflowerblue")) +
  facet_wrap(~obese)

## `geom_smooth()` using formula = 'y ~ x'

What do you observe?

Obese people spend the most on smoking.

Labels

Graphs should be easy to interpret and informative labels are a key element in achieving this goal. The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.

# add informative labels
ggplot(data = insurance,
       mapping = aes(x = age, 
                     y = expenses,
                     color = smoker)) +
  geom_point(alpha = .5) +
  geom_smooth(method = "lm", 
              se = FALSE) +
  scale_x_continuous(breaks = seq(0, 70, 10)) +
  scale_y_continuous(breaks = seq(0, 60000, 20000),
                     label = scales::dollar) +
  scale_color_manual(values = c("indianred3", 
                                "cornflowerblue")) +
  facet_wrap(~obese) +
  labs(title = "Relationship between patient demographics and medical costs",
       subtitle = "US Census Bureau 2013",
       caption = "source: http://mosaic-web.org/",
       x = " Age (years)",
       y = "Annual expenses",
       color = "Smoker?")

## `geom_smooth()` using formula = 'y ~ x'

Themes

Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph. Let’s use a cleaner theme.

# use a minimalist theme
ggplot(data = insurance,
       mapping = aes(x = age, 
                     y = expenses,
                     color = smoker)) +
  geom_point(alpha = .5) +
  geom_smooth(method = "lm", 
              se = FALSE) +
  scale_x_continuous(breaks = seq(0, 70, 10)) +
  scale_y_continuous(breaks = seq(0, 60000, 20000),
                     label = scales::dollar) +
  scale_color_manual(values = c("indianred3", 
                                "cornflowerblue")) +
  facet_wrap(~obese) +
  labs(title = "Relationship between age and medical expenses",
       subtitle = "US Census Data 2013",
       caption = "source: https://github.com/dataspelunking/MLwR",
       x = " Age (years)",
       y = "Medical Expenses",
       color = "Smoker?") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

From this graph, we can make a few observations:

There is a positive linear relationship between age and expenses. The relationship is constant across smoking and obesity status (i.e., the slope doesn’t change). -Smokers and obese patients have higher medical expenses. There is an interaction between smoking and obesity. Non-smokers look fairly similar across obesity groups. However, for smokers, obese patients have much higher expenses.
There are some very high outliers (large expenses) among the obese smoker group.
These findings are tentative. They are based on limited sample size and do not involve statistical testing to assess whether differences may be due to chance variation.

3.2 Placing the data and mapping options

Plots created with ggplot2 always start with the ggplot function. In the examples above, the data and mapping options were placed in this function. In this case they apply to each geom_ function that follows. You can also place these options directly within a geom. In that case, they only apply to that specific geom.

# placing color mapping in the ggplot function
ggplot(insurance,
       aes(x = age, 
           y = expenses,
           color = smoker)) +
  geom_point(alpha = .5,
             size = 2) +
  geom_smooth(method = "lm",
              se = FALSE, 
              size = 1.5)

## `geom_smooth()` using formula = 'y ~ x'

# placing color mapping in the geom_point function
ggplot(insurance,
       aes(x = age, 
           y = expenses)) +
  geom_point(aes(color = smoker),
             alpha = .5,
             size = 2) +
  geom_smooth(method = "lm",
              se = FALSE, 
              size = 1.5)

## `geom_smooth()` using formula = 'y ~ x'

What is the difference between the two graphs.

The one without the aesthetic has two best fit lines, whereas the one with the aesthetic has one line of best fit.

A few notes:

Most of the examples in this class will place the data and mapping options in the ggplot function.
The phrases data= and mapping= can be omitted since the first option always refers to data and the second option always refers to mapping.

3.3 Graphs as Objects

A ggplot2 graph can be saved as a named R object (like a data frame), manipulated further, and then printed or saved to disk.

# create a scatterplot and save it
myplot <- ggplot(data = insurance,
                  aes(x = age, y = expenses)) +
             geom_point()

# plot the graph
myplot

# make the points larger and blue
# then print the graph
myplot <- myplot + geom_point(size = 2, color = "blue")
myplot

# print the graph with a title and line of best fit
# but don't save those changes
myplot + geom_smooth(method = "lm") +
  labs(title = "Mildly interesting graph")

## `geom_smooth()` using formula = 'y ~ x'

# print the graph with a black-and-white theme
# but don't save those changes
myplot + theme_bw()

This can be a real time saver. It is also handy when saving graphs programmatically.

Practice

Read the loan50.csv dataset.

Please refer to the following for more description of the variables in this data set.
https://www.openintro.org/data/index.php?data=loan50

loans = read.csv("loan50.csv")
head(loans)

##   state emp_length term homeownership annual_income verified_income
## 1    NJ          3   60          rent         59000    Not Verified
## 2    CA         10   36          rent         60000    Not Verified
## 3    SC         NA   36      mortgage         75000        Verified
## 4    CA          0   36          rent         75000    Not Verified
## 5    OH          4   60      mortgage        254000    Not Verified
## 6    IN          6   36      mortgage         67000 Source Verified
##   debt_to_income total_credit_limit total_credit_utilized
## 1      0.5575254              95131                 32894
## 2      1.3056833              51929                 78341
## 3      1.0562800             301373                 79221
## 4      0.5743467              59890                 43076
## 5      0.2381496             422619                 60490
## 6      1.0770448             349825                 72162
##   num_cc_carrying_balance       loan_purpose loan_amount grade interest_rate
## 1                       8 debt_consolidation       22000     B         10.90
## 2                       2        credit_card        6000     B          9.92
## 3                      14 debt_consolidation       25000     E         26.30
## 4                      10        credit_card        6000     B          9.92
## 5                       2   home_improvement       25000     B          9.43
## 6                       4   home_improvement        6400     B          9.92
##   public_record_bankrupt loan_status has_second_income total_income
## 1                      0     Current             FALSE        59000
## 2                      1     Current             FALSE        60000
## 3                      0     Current             FALSE        75000
## 4                      0     Current             FALSE        75000
## 5                      0     Current             FALSE       254000
## 6                      0     Current             FALSE        67000

Which of the numeric variables are really categorical variables?

Term and public record of bankrupt.

For ggplot to plot the variable from #2 as categories, we must convert them to factors. Run the code. What is the variable type of these categorical variables now?

loans <- mutate(loans,
    term = as.factor(term),
    public_record_bankrupt = as.factor(term))

loans <- mutate(loans,
                across( c(term, public_record_bankrupt), as.factor))

These variables are now numerical.

Analyze the following using summary and graphical plots.

What is the relationship between annual income and total credit limit?
What is the relationship between total credit limit and interest rate?
What is the relationship between annual income and interest rate?

For each, determine which factors can be used to better understand this relationship (e.g. loan grade, term, homeowership, whether income was verified, etc.).

Total Credit Limit vs. Annual Income

p = ggplot( data = loans,
        mapping = aes( x = annual_income,
                       y = total_credit_limit,
                       color = term)) +
  geom_point()

library(scales)
p + scale_x_continuous(labels = unit_format(unit = "k", scale = 1e-3)) +
  scale_y_continuous(labels = unit_format(unit = "k", scale = 1e-3))

library(scales)
p + scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = unit_format(unit = "$")) +
  labs( title = "Total Credit Limit vs Annual Income",
        x = "Annual Income ($)", 
        y = "Total Credit Limit ($)")

Interest Rate vs. Total Credit Limit

p = ggplot( data = loans,
        mapping = aes( x = total_credit_limit,
                       y = interest_rate,
                       color = public_record_bankrupt)) +
  geom_point()

library(scales)
p + scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = unit_format(unit = "%")) +
  labs( title = "Interest Rate vs Total Credit Limit",
        x = "Total Credit Limit ($)", 
        y = "Interest Rate (%)")

Interest Rate vs. Annual Income

p = ggplot( data = loans,
        mapping = aes( x = annual_income,
                       y = interest_rate,
                       color = homeownership)) +
  geom_point()

library(scales)
p + scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = unit_format(unit = "%")) +
  labs( title = "Interest Rate vs Annual Income",
        x = "Annual Income ($)", 
        y = "Interest Rate (%)")

Summarize the key findings from your above analysis.

In the first graph I noticed that the people with the 36 and 60 month terms earn around the same amount of money. I also noticed that their total credit limits are around the same with a couple exceptions of outliers.

In the second graph I noticed that there were a lot more people declaring bankrupt from the annual income of $50,000-$140,000. There were also a few outliers in this graph but the majority of people who went bankrupt were in the same cluster of annual income.

In the third graph I noticed that people with mortgage rates ranged from high to low interest rates. The people who rented also had higher interest rates, but some also had low interest rates. The people who are owners are in the middle of the of the scatter plot. The most common interest rate appears to be around 11%.

Assignment 4 - Introduction to ggplot2

Emma Fields

2024-09-19