## Set the default size of figures and default to printing the R code
knitr::opts_chunk$set(fig.align = "center",
echo = F,
warning = F,
message = F,
include = T)
## Load the libraries we will be using
pacman::p_load(tidyverse, skimr)
## Changing the default theme to black/white instead of grey
theme_set(theme_bw())
## Read in the titanic.csv file and save it as t_df.
# Include stringsAsFactors = T in read.csv() to change all the strings to factors:
t_df <-
read.csv("https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/titanic.csv",
stringsAsFactors = T)
Skimming the data set
Name | t_df |
Number of rows | 2200 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
factor | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
class | 0 | 1 | FALSE | 4 | Cre: 885, Thi: 705, Fir: 325, Sec: 285 |
age | 0 | 1 | FALSE | 2 | Adu: 2092, Chi: 108 |
sex | 0 | 1 | FALSE | 2 | Mal: 1730, Fem: 470 |
status | 0 | 1 | FALSE | 2 | Dea: 1490, Ali: 710 |
The titanic data has four variables:
You can check out the group order of a factor by using
levels()
. Let’s look at the group order of
class:
## [1] "Crew" "First" "Second" "Third"
They aren’t in the order that we would want to see them in. We can manually specify the order of groups of a factor. How?
By using the factor()
function. If you want to change
the levels of a factor, you need to give factor()
two
arguments:
x =
the factor (t_df$class
)levels =
a vector of the names of the groups in the
order we want them presented.levels()
function we used
earlier, otherwise the rows with misspelled groups will be converted to
NAst_df$class <-
factor(x = t_df$class,
levels = c("First", "Second", "Third", "Crew"))
# Checking that the groups are in the new order
levels(t_df$class)
## [1] "First" "Second" "Third" "Crew"
Great! But why do we care about the order of the groups? Because when we make our bar chart, the group order will also be the order of the bars. If there isn’t a natural order to the groups, you don’t need to worry about changing the group order. But if there is a natural order, placing the groups in the correct order should be the first step you take!
Extra:
If the order you want the groups/levels to be in are the same order
that the groups first appear in the data itself, like it is with our
data: First/Second/Third/Crew, you can use as_factor()
from
the forcats
package (part of the tidyverse) to change the
order of the levels a little simpler.
Note: This only works if the column is a character, not already a factor, like class is already in this example :(
## [1] "First" "Second" "Third" "Crew"
geom_bar()
Start with a basic bar graph of passenger class. The most direct way
to create a bar chart with unsummarized, “raw” data is to add
geom_bar()
to the blank plot created by
ggplot()
.
To create a basic bar chart with geom_bar()
, you need to
use either the x
or y
aesthetic (depending if
you want the group names on the x or y-axis) and not
assign the other aesthetic to a column.
If you want to change the color of the bars created by
geom_bar()
(or any other bars created by a geom), you
should use fill
, not color
.
First create a bar chart using geom_bar()
. Once you’ve
got the code to run properly, SET the fill
aesthetic to
“steelblue” and color
to “black”.
Then, once you have that working, change the label on the x-axis to say “Passenger Class” and add a title of “Titanic Passengers”
Remember, setting an aesthetic means do not include it inside
aes()
While it is not generally recommended, you can change the color of
each bar by mapping fill
to class as well. If you
do, you should remove the legend ggplot()
automatically
creates. This can be done by including show.legend = F
in
geom_bar()
You can also manually pick the color of the bars by giving the
fill
argument a vector of colors equal to the number of
bars. Redo the graph above, but set
fill = c("tomato", "steelblue", "seagreen", "orchid")
Once you’ve got it to match what is in Brightspace, move on to the next step: replacing the y-axis counts with proportions!
When only working with categorical data, it’s common to work with a
table of counts instead of the “raw”, unsummarized data. For instance,
we can count the number of passengers in each class using the
count()
function:
## class n
## 1 First 325
## 2 Second 285
## 3 Third 705
## 4 Crew 885
What do we do if we try to create a bar chart with the summarized data set passenger_count?
Take the code from the setting fill with a vector code chunk, copy then paste it in the code chunk below, then make the following changes:
data = passenger_count
mapping = aes()
, add y = n
Once you’ve made those changes, uncomment the scale_
function at the bottom of the code chunk
You should see an error. By the second ! in the error message that
says stat_count()
must only have an x or y
aesthetic. The message means that if you are using
geom_bar()
, you can’t specify both x
and y
as geom_bar()
will calculate
the counts for the unmapped aesthetic. So what do we do?
If you’re working data and you need to specify both x
and y
(like our passenger_count data), we replace
geom_bar()
with geom_col()
. Copy and paste the
code above into the code chunk below, then make the appropriate
change!
Now we will be looking at the association between passenger class and passenger status.
We want to know, “were passengers more or less likely to survive the sinking of the Titanic depending on their passenger class?”
Anytime you are working with two variables, it is important to determine the role the variables play in the data - explanatory or response.
If we want to know if a passenger was more or less likely to survive based on the type of passenger, which of the two variables is the explanatory and which is the response?
Explanatory: __________________
Response: ____________________
When creating a bar chart for two variables, you want to use the
x
(or y
) and fill
aesthetics, and
the role each variable plays determines which aesthetic it will be
assigned to:
x = {explanatory}
fill = {response}
These can be mapped in either ggplot()
or
geom_bar()
Start by creating a blank graph by mapping the variables in
ggplot()
and using labs()
to change the x and
y-axis labels and add a title to match what is in Brightspace. Save the
result as gg_titanic
.
Once you’ve created your blank graph, add geom_bar()
to
it (but don’t save the result, at least not yet…)
geom_bar()
defaults to placing the bars for the
fill
variable stacked on top of one another. If we keep the
default choice, this is called a segmented bar chart (which is better
when comparing groups of one variable across another variable).
But how can we change it to be a side-by-side bar chart (what is typically seen)?
Each geom has a position =
argument, which specifies how
to position objects that have the same x or y value.
For instance, the bar for Alive/First and
Dead/First have the same x-value: First.
ggplot()
needs a way to position the bars in the graph, and
geom_bar()
defaults to “stack”.
If we want the bars to be placed next to each other, we need to
specify that position = "dodge"
(so they are near but don’t
overlap)
The common choices for position
are:
geom_point()
)If you want to place them next to each other, try
position = "dodge"
or position = "dodge2"
While it is never recommended for bar charts, try using
position = "identity"
and see how it looks! It will help to
include alpha = 0.5
in geom_bar()
to figure
out what is happening!
geom_bar()
defaults to summarizing the data with counts,
which can make it a difficult to compare if the survival and death rates
are the same across the 4 different passenger classes since Third and
Crew are so much larger than First and Second.
Instead, it’s best to show the survival rates by changing the y-axis to the conditional proportion - the percentage of survivors within each passenger class (and the same with the percentage that died within each passenger class)
What do we do if we to represent the proportion of passengers that lived or died within each passenger class?
If you want to display the conditional proportions within each group,
you just need to include position = "fill"
in
geom_bar()
.
Once you created the bar chart, save the result as
gg_class_status
Now we get a much clearer picture. If passenger class and status were independent, the dividing line within each bar should be at about the same height. Since the red area for First class is much larger than the other 3 passenger classes, it displays that they had a much higher chance of surviving!
The same holds true comparing Second class with Third and Crew, just not to the same extent as first class passengers.
Like we did with the summarized data set for passenger class earlier,
we can create the same bar charts as seen above using
geom_col()
instead of geom_bar()
. Under the
code creating the passenger_status data set below, create a
stacked bar chart using the passenger_status data set:
Now, using what you’ve learned earlier, create the stacked bar chart with conditional percentages on the y-axis instead of the counts!
geom_bar()
will default to stacked bar charts,
regardless if it is calculating counts or proportions. How do we create
a side-by-side bar chart with conditional proportions using
geom_bar()
?
Why can’t we create a side-by-side bar chart using
geom_bar()
displaying the conditional proportions?
To create a side-by-side bar chart, you specify
position = "dodge"
To display the conditional proportions, you specify
position = "fill"
.
And each function in R can only use the same argument once. We can
have two different position =
arguments in the same
geom!
While it is possible to create such a graph using
ggplot()
, we can’t do it using geom_bar()
and
the “raw” data. But we can do it!
We’ll see later how we can do it ourselves, but it involves calculating the conditional proportions using R, and we haven’t gotten to that point yet!
Let’s improve the look of our conditional proportion bar chart by doing the following:
Removing the label for fill
since Alive and Dead
don’t need added context
Change Alive to steelblue and dead to tomato (blue generally indicates good and orange/red indicates bad)
Change the y-axis to be percentages
Remove the added space at the bottom of the graph
gg_class_status +
# Removing the fill label with labs() and NULL
labs(fill = NULL) +
# Changing the colors used for fill with scale_fill_manual() and values
scale_fill_manual(values = c("Alive" = "steelblue",
"Dead" = "tomato")) +
# Changing the labels on the y-axis using scale_y_continuous() and labels
scale_y_continuous(labels = scales::percent,
# expand controls how much space is added at the bottom and top of the graph
expand = c(0,0, # The first 2 numbers are what is added and multiplied to the bottom of the graph
0,0.05)) # The second 2 numbers are what is added and multiplied to the top of the graph