## Set the default size of figures and default to printing the R code
knitr::opts_chunk$set(fig.align = "center",
echo = F,
include = T)
## Load the libraries we will be using
pacman::p_load(tidyverse, skimr)
## Changing the default theme to black/white instead of grey
theme_set(theme_bw())
## Read in the titanic.csv file and save it as t_df.
# Include stringsAsFactors = T in read.csv() to change all the strings to factors:
t_df <- read.csv("titanic.csv",
stringsAsFactors = T)
skim(t_df)
Name | t_df |
Number of rows | 2200 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
factor | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
class | 0 | 1 | FALSE | 4 | Cre: 885, Thi: 705, Fir: 325, Sec: 285 |
age | 0 | 1 | FALSE | 2 | Adu: 2092, Chi: 108 |
sex | 0 | 1 | FALSE | 2 | Mal: 1730, Fem: 470 |
status | 0 | 1 | FALSE | 2 | Dea: 1490, Ali: 710 |
The titanic data has four variables:
You can check out the group order of a factor by using
levels()
. Let’s look at the group order of
class:
levels(t_df$class)
## [1] "Crew" "First" "Second" "Third"
They aren’t in the order that we would want to see them in. We can manually specify the order of groups of a factor. How?
By using the factor()
function. If you want to change
the levels of a factor, you need to give factor()
two
arguments:
x =
the factor (t_df$class)levels =
a vector of the names of the groups in the
order we want them presented.levels()
function we used
earlier, otherwise the rows with misspelled groups will be converted to
NAst_df$class <-
factor(x = t_df$class,
levels = c("First", "Second", "Third", "Crew"))
# Checking that the groups are in the new order
levels(t_df$class)
## [1] "First" "Second" "Third" "Crew"
Great! But why do we care about the order of the groups? Because when we make our bar chart, the group order will also be the order of the bars. If there isn’t a natural order to the groups, you don’t need to worry about changing the group order. But if there is a natural order, placing the groups in the correct order should be the first step you take!
Start with a bar graph of passenger class. The most direct way to
create a bar chart with unsummarized, “raw” data is to add
geom_bar()
to the blank plot created by
ggplot()
.
To create a basic bar chart with geom_bar()
, you need to
use either the x
or y
aesthetic (depending if
you want the group names on the x or y-axis) and not assign the other
aesthetic to a column.
If you want to change the color of the bars created by
geom_bar()
(or any other bars created by a geom), you
should use fill
, not color
.
First create a bar chart using geom_bar()
. Once you’ve
got the code to run properly, SET the fill
aesthetic to
“steelblue” and color
to “black”.
Then, once you have that working, change the label on the x-axis to say “Passenger Class” and add a title of “Titanic Passengers”
Remember, setting an aesthetic means do not include it inside
aes()
While it is not generally recommended, you can change the color of
each bar by mapping fill
to class as well. If you
do, you should remove the legend ggplot()
automatically
creates. This can be done by including show.legend = F
in
geom_bar()
You can also manually pick the color of the bars by giving the
fill
argument a vector of colors equal to the number of
bars. Redo the graph above, but set
fill = c("tomato", "steelblue", "seagreen", "orchid")
Once you’ve got it to match what is in Brightspace, move on to the next step: replacing the y-axis counts with proportions!
When only working with categorical data, it’s common to work with a
table of counts instead of the “raw”, unsummarized data. For instance,
we can count the number of passengers in each class using the
count()
function:
## class n
## 1 First 325
## 2 Second 285
## 3 Third 705
## 4 Crew 885
What do we do if we try to create a bar chart with the summarized
data set passenger_count? Let’s try using
geom_bar()
and mapping the counts, n, to
the y
aesthetic:
You should see an error by the second ! in the error message that
says stat_count()
must only have an x or y
aesthetic. The message means that if you are using
geom_bar()
, you can’t specify both x
and
y
as geom_bar()
will calculate the counts for
the other aesthetic. So what do we do?
If you’re working data and you need to specify both x
and y
(like our passenger_count data), we replace
geom_bar()
with geom_col()
. Copy and paste the
code above into the code chunk below, then make the appropriate
change!
Now we will be looking at the association between passenger class and passenger status.
We want to know, were passengers more or less likely to survive the sinking of the Titanic depending on their passenger class?
Anytime you are working with two variables, it is important to determine the role the variables play in the data - explanatory or response.
If we want to know if a passenger was more or less likely to survive based on the type of passenger, which of the two variables is the explanatory and which is the response?
Explanatory:
Response:
When creating a bar chart for two variables, you want to use the
x
(or y
) and fill
aesthetics, and
the role each variable plays determines which aesthetic it will be
assigned to:
x =
explanatory
fill =
response
These can be mapped in either ggplot()
or
geom_bar()
Start by creating a blank map by mapping the variables in
ggplot()
and using labs()
to change the x and
y-axis labels and add a title to match what is in Brightspace. Save the
result as gg_titanic
.
Once you’ve created your blank map, add geom_bar()
to it
(but don’t save the result, at least not yet…)
geom_bar()
defaults to placing the bars for the
fill
variable stacked on top of one another. If we keep the
default choice, this is called a segmented bar chart (which is better
when comparing groups of one variable across another variable).
But how can we change it to be a side-by-side bar chart (what is typically seen)?
Each geom has a position =
argument, which specifies how
to position objects that have the same x or y value.
For instance, the bar for Alive/First and
Dead/First have the same x-value: First.
ggplot()
needs a way to position them in the graph, and
geom_bar()
defaults to “stack”.
If we want the bars to be placed next to each other, we need to
specify that position = "dodge"
(so they are near but don’t
overlap)
The common choices for position
are:
If you want to place them next to each other, try
position = "dodge"
or position = "dodge2"
While it is never recommended for bar charts, try changing
position
to “identity” and see how it looks! It will help
to include alpha = 0.5
in geom_bar()
to figure
out what is happening!
geom_bar()
defaults to summarizing the data with counts,
which can make it a difficult to compare if the survival and death rates
are the same across the 4 different passenger classes since Third and
Crew are so much larger than First and Second.
Instead, it’s best to show the survival rates by changing the y-axis to the conditional proportion - the percentage of survivors within each passenger class (and the same with the percentage that died within each passenger class)
What do we do if we to represent the proportion of passengers that lived or died within each passenger class?
Thankfully it is a lot easier and intuitive than displaying a
proportion for a single variable. If you want to display the conditional
proportions within each group, you just need to include
position = "fill"
in geom_bar()
.
Once you created the bar chart, save the result as
gg_class_status
Now we get a much clearer picture. If passenger class and status were independent, the dividing line within each bar should be at about the same height. Since the red area for First class is much larger than the other 3 passenger classes, it displays that they had a much higher chance of surviving!
The same holds true comparing Second class with Third and Crew, just not to the same extent as first class passengers.
Like we did with the summarized data set for passenger class ealier,
we can create the same bar charts as seen above using
geom_col()
instead of geom_bar()
. Under the
code creating the passenger_status data set below, create a
stacked bar chart using the passenger_status data set:
Now, using what you’ve learned earlier, create the stacked bar chart with conditional percentages on the y-axis instead of the counts!
geom_bar()
will default to stacked bar charts,
regardless if it is calculating counts or proportions. How do we create
a side-by-side bar chart with conditional proportions using
geom_bar()
?
Why can’t we create a side-by-side bar chart using
geom_bar()
displaying the conditional proportions?
To create a side-by-side bar chart, you specify
position = "dodge"
To display the conditional proportions, you specify
position = "fill"
.
And each function in R can only use the same argument once. We can
have two different position =
arguments in the same
geom!
While it is possible to create such a graph using
ggplot()
, we can’t do it using geom_bar()
and
the “raw” data. But we can do it!
We’ll see later how we can do it ourselves, but it involves calculating the conditional proportions using R, and we haven’t gotten to that point yet!
Let’s improve the look of our conditional proportion bar chart by doing the following:
Removing the label for fill
since Alive and Dead
don’t need added context
Change Alive to blue and dead to orange (blue generally indicates good and orange/red indicates bad)
Change the y-axis to be percentages
Remove the added space at the bottom of the graph
gg_class_status +
# Removing the fill label with labs() and NULL
labs(fill = NULL) +
# Changing the colors used for fill with scale_fill_manual() and values
scale_fill_manual(values = c("Alive" = "steelblue",
"Dead" = "tomato")) +
# Changing the labels on the y-axis using scale_y_continuous() and labels
scale_y_continuous(labels = scales::percent,
# expand controls how much space is added at the bottom and top of the graph
expand = c(0,0, # The first 2 numbers are what is added and multiplied to the bottom of the graph
0,0.05)) # The second 2 numbers are what is added and multiplied to the top of the graph