Set Up Your Project and Load Libraries

## Set the default size of figures and default to printing the R code
knitr::opts_chunk$set(fig.align = "center",
                      echo = F,
                      include = T)  

## Load the libraries we will be using
pacman::p_load(tidyverse, skimr)

## Changing the default theme to black/white instead of grey
theme_set(theme_bw())

## Read in the titanic.csv file and save it as t_df. 
# Include stringsAsFactors = T in read.csv() to change all the strings to factors: 
t_df <- read.csv("titanic.csv",
                 stringsAsFactors = T)

skim(t_df)
Data summary
Name t_df
Number of rows 2200
Number of columns 4
_______________________
Column type frequency:
factor 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
class 0 1 FALSE 4 Cre: 885, Thi: 705, Fir: 325, Sec: 285
age 0 1 FALSE 2 Adu: 2092, Chi: 108
sex 0 1 FALSE 2 Mal: 1730, Fem: 470
status 0 1 FALSE 2 Dea: 1490, Ali: 710

The titanic data has four variables:

Reordering the groups of class

You can check out the group order of a factor by using levels(). Let’s look at the group order of class:

levels(t_df$class)
## [1] "Crew"   "First"  "Second" "Third"

They aren’t in the order that we would want to see them in. We can manually specify the order of groups of a factor. How?

By using the factor() function. If you want to change the levels of a factor, you need to give factor() two arguments:

  1. x = the factor (t_df$class)
  2. levels = a vector of the names of the groups in the order we want them presented.
t_df$class <- 
  factor(x = t_df$class,
         levels = c("First", "Second", "Third", "Crew"))

# Checking that the groups are in the new order
levels(t_df$class)
## [1] "First"  "Second" "Third"  "Crew"

Great! But why do we care about the order of the groups? Because when we make our bar chart, the group order will also be the order of the bars. If there isn’t a natural order to the groups, you don’t need to worry about changing the group order. But if there is a natural order, placing the groups in the correct order should be the first step you take!

Categorical Data: Bar Graphs with geom_bar()

Start with a bar graph of passenger class. The most direct way to create a bar chart with unsummarized, “raw” data is to add geom_bar() to the blank plot created by ggplot().

To create a basic bar chart with geom_bar(), you need to use either the x or y aesthetic (depending if you want the group names on the x or y-axis) and not assign the other aesthetic to a column.

If you want to change the color of the bars created by geom_bar() (or any other bars created by a geom), you should use fill, not color.

First create a bar chart using geom_bar(). Once you’ve got the code to run properly, SET the fill aesthetic to “steelblue” and color to “black”.

Then, once you have that working, change the label on the x-axis to say “Passenger Class” and add a title of “Titanic Passengers”

Remember, setting an aesthetic means do not include it inside aes()

While it is not generally recommended, you can change the color of each bar by mapping fill to class as well. If you do, you should remove the legend ggplot() automatically creates. This can be done by including show.legend = F in geom_bar()

You can also manually pick the color of the bars by giving the fill argument a vector of colors equal to the number of bars. Redo the graph above, but set fill = c("tomato", "steelblue", "seagreen", "orchid")

Once you’ve got it to match what is in Brightspace, move on to the next step: replacing the y-axis counts with proportions!

Creating a bar chart using summarized data

When only working with categorical data, it’s common to work with a table of counts instead of the “raw”, unsummarized data. For instance, we can count the number of passengers in each class using the count() function:

##    class   n
## 1  First 325
## 2 Second 285
## 3  Third 705
## 4   Crew 885

What do we do if we try to create a bar chart with the summarized data set passenger_count? Let’s try using geom_bar() and mapping the counts, n, to the y aesthetic:

You should see an error by the second ! in the error message that says stat_count() must only have an x or y aesthetic. The message means that if you are using geom_bar(), you can’t specify both x and y as geom_bar() will calculate the counts for the other aesthetic. So what do we do?

If you’re working data and you need to specify both x and y (like our passenger_count data), we replace geom_bar() with geom_col(). Copy and paste the code above into the code chunk below, then make the appropriate change!

Two categorical variables

Now we will be looking at the association between passenger class and passenger status.

We want to know, were passengers more or less likely to survive the sinking of the Titanic depending on their passenger class?

Two Variable Bar Graphs with counts - Stacked and Side-by-Side

Anytime you are working with two variables, it is important to determine the role the variables play in the data - explanatory or response.

If we want to know if a passenger was more or less likely to survive based on the type of passenger, which of the two variables is the explanatory and which is the response?

  • Explanatory:

  • Response:

When creating a bar chart for two variables, you want to use the x (or y) and fill aesthetics, and the role each variable plays determines which aesthetic it will be assigned to:

  • x = explanatory

  • fill = response

These can be mapped in either ggplot() or geom_bar()

Start by creating a blank map by mapping the variables in ggplot() and using labs() to change the x and y-axis labels and add a title to match what is in Brightspace. Save the result as gg_titanic.

Once you’ve created your blank map, add geom_bar() to it (but don’t save the result, at least not yet…)

geom_bar() defaults to placing the bars for the fill variable stacked on top of one another. If we keep the default choice, this is called a segmented bar chart (which is better when comparing groups of one variable across another variable).

But how can we change it to be a side-by-side bar chart (what is typically seen)?

position argument

Each geom has a position = argument, which specifies how to position objects that have the same x or y value.

For instance, the bar for Alive/First and Dead/First have the same x-value: First. ggplot() needs a way to position them in the graph, and geom_bar() defaults to “stack”.

If we want the bars to be placed next to each other, we need to specify that position = "dodge" (so they are near but don’t overlap)

The common choices for position are:

  • stacked: Place them on top of one another
  • dodge: place them next to each other
  • identity: place them exactly at x and y
  • jitter: randomly moves them small amount (mainly used with points in scatterplot)

If you want to place them next to each other, try position = "dodge" or position = "dodge2"

While it is never recommended for bar charts, try changing position to “identity” and see how it looks! It will help to include alpha = 0.5 in geom_bar() to figure out what is happening!

Displaying Conditional Proportions with geom_bar()

geom_bar() defaults to summarizing the data with counts, which can make it a difficult to compare if the survival and death rates are the same across the 4 different passenger classes since Third and Crew are so much larger than First and Second.

Instead, it’s best to show the survival rates by changing the y-axis to the conditional proportion - the percentage of survivors within each passenger class (and the same with the percentage that died within each passenger class)

What do we do if we to represent the proportion of passengers that lived or died within each passenger class?

Thankfully it is a lot easier and intuitive than displaying a proportion for a single variable. If you want to display the conditional proportions within each group, you just need to include position = "fill" in geom_bar().

Once you created the bar chart, save the result as gg_class_status

Now we get a much clearer picture. If passenger class and status were independent, the dividing line within each bar should be at about the same height. Since the red area for First class is much larger than the other 3 passenger classes, it displays that they had a much higher chance of surviving!

The same holds true comparing Second class with Third and Crew, just not to the same extent as first class passengers.

Like we did with the summarized data set for passenger class ealier, we can create the same bar charts as seen above using geom_col() instead of geom_bar(). Under the code creating the passenger_status data set below, create a stacked bar chart using the passenger_status data set:

Now, using what you’ve learned earlier, create the stacked bar chart with conditional percentages on the y-axis instead of the counts!

Side-by-Side Bar Charts for conditional proportions using geom_bar()

geom_bar() will default to stacked bar charts, regardless if it is calculating counts or proportions. How do we create a side-by-side bar chart with conditional proportions using geom_bar()?

Why can’t we create a side-by-side bar chart using geom_bar() displaying the conditional proportions?

To create a side-by-side bar chart, you specify position = "dodge"

To display the conditional proportions, you specify position = "fill".

And each function in R can only use the same argument once. We can have two different position = arguments in the same geom!

While it is possible to create such a graph using ggplot(), we can’t do it using geom_bar() and the “raw” data. But we can do it!

We’ll see later how we can do it ourselves, but it involves calculating the conditional proportions using R, and we haven’t gotten to that point yet!

Making our plot look better

Let’s improve the look of our conditional proportion bar chart by doing the following:

  1. Removing the label for fill since Alive and Dead don’t need added context

  2. Change Alive to blue and dead to orange (blue generally indicates good and orange/red indicates bad)

  3. Change the y-axis to be percentages

  4. Remove the added space at the bottom of the graph

gg_class_status + 
  
  # Removing the fill label with labs() and NULL
  labs(fill = NULL) + 

  # Changing the colors used for fill with scale_fill_manual() and values 
  scale_fill_manual(values = c("Alive" = "steelblue",
                               "Dead" = "tomato")) + 
  
  # Changing the labels on the y-axis using scale_y_continuous() and labels
  scale_y_continuous(labels = scales::percent,
                     # expand controls how much space is added at the bottom and top of the graph
                     expand = c(0,0,        # The first 2 numbers are what is added and multiplied to the bottom of the graph      
                                0,0.05))    # The second 2 numbers are what is added and multiplied to the top of the graph