knitr::opts_chunk$set(echo = F)
# Start by loading the tidyverse, gt, and skimr package
pacman::p_load(tidyverse, skimr, gt)
# Next, read in the Titanic Data set from github
titanic <- read.csv("https://raw.githubusercontent.com/Shammalamala/DS-1870-Data/main/titanic.csv")
Let’s check the data by using head()
and
skim()
## class age sex survival
## 1 First Adult Male Alive
## 2 First Adult Male Alive
## 3 First Adult Male Alive
## 4 First Adult Male Alive
## 5 First Adult Male Alive
## 6 First Adult Male Alive
Name | titanic |
Number of rows | 2201 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
class | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
age | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
survival | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Looks like we have 4 categorical columns and no missing values!
There are two types of tables we can create to summarize categorical data:
Let’s start by creating the simpler, frequency table
We’ll use the count()
function in the dplyr
package (in tidyverse
) to create a frequency table. It
requires two pieces:
count()
: data |> count()
count(col_name)
Start by creating a frequency table for class
## class n
## 1 Crew 885
## 2 First 325
## 3 Second 285
## 4 Third 706
Something isn’t quite right with the table. The groups aren’t in the correct order! How can we arrange the groups as: First, Second, Third, and Crew?
By converting them to a factor and specifying the order of the groups
using the factor()
function and levels
argument:
When rearranging the group order, it is very, very important that you type the groups names exactly as they appear in the column, including if they are capitalized!
Now we can create the table with the groups in the correct order!
This time, save the table as class_freq but change the name of
the n column (default name) to freq by including the
additional name = "freq"
argument:
## class freq
## 1 First 325
## 2 Second 285
## 3 Third 706
## 4 Crew 885
We can improve how the table looks using the gt()
function from the gt package! Make sure not to save these, because once
you use the gt()
function, you won’t be able to make any
changes!
class | freq |
---|---|
First | 325 |
Second | 285 |
Third | 706 |
Crew | 885 |
Let’s add a column to our frequency table for the proportions/relative frequencies!
To do this, we can add another step in our pipe chain by using
mutate()
. mutate()
is a function we’ll look at
in more detail later, but in short, it will add new columns to our data
frame by mutate(new_col_name = ...)
, where
new_col_name is the name of the new column you want to make
(like prop).
Try it out in the code chunk below by starting with the titanic data
frame, count()
and mutate()
functions!
## class freq prop
## 1 First 325 0.1476602
## 2 Second 285 0.1294866
## 3 Third 706 0.3207633
## 4 Crew 885 0.4020900
We can also round the props using the round()
function
like we saw earlier inside the mutate()
function. Copy your
code from the previous code chunk, change the code so the prop
column is rounded to 3 decimal places, and save it as
class_rel_freq
## class freq prop
## 1 First 325 0.148
## 2 Second 285 0.129
## 3 Third 706 0.321
## 4 Crew 885 0.402
Finally, we’ll add a column named percent to the class_rel_freq as seen below:
## class freq prop percent
## 1 First 325 0.148 14.8
## 2 Second 285 0.129 12.9
## 3 Third 706 0.321 32.1
## 4 Crew 885 0.402 40.2
And if you wanna get really fancy, instead we can use the
paste0()
function as seen below:
class_rel_freq <-
class_rel_freq |>
# Using mutate() again to add a percentage column
mutate(
percent = paste0(prop*100, "%")
)
class_rel_freq |>
gt()
class | freq | prop | percent |
---|---|---|---|
First | 325 | 0.148 | 14.8% |
Second | 285 | 0.129 | 12.9% |
Third | 706 | 0.321 | 32.1% |
Crew | 885 | 0.402 | 40.2% |