Setting up the R Markdown File

knitr::opts_chunk$set(echo = F)

# Start by loading the tidyverse, gt, and skimr package
pacman::p_load(tidyverse, skimr, gt)

# Next, read in the Titanic Data set from github
titanic <- read.csv("https://raw.githubusercontent.com/Shammalamala/DS-1870-Data/main/titanic.csv")

Let’s check the data by using head() and skim()

##   class   age  sex survival
## 1 First Adult Male    Alive
## 2 First Adult Male    Alive
## 3 First Adult Male    Alive
## 4 First Adult Male    Alive
## 5 First Adult Male    Alive
## 6 First Adult Male    Alive
Data summary
Name titanic
Number of rows 2201
Number of columns 4
_______________________
Column type frequency:
character 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
class 0 1 4 6 0 4 0
age 0 1 5 5 0 2 0
sex 0 1 4 6 0 2 0
survival 0 1 4 5 0 2 0

Looks like we have 4 categorical columns and no missing values!

Section 1: Tables for a single categorical variable

There are two types of tables we can create to summarize categorical data:

  1. Frequency table: Counting how many cases occur in each of the variable’s groups
  2. Relative frequency table: Converts the counts to proportions

Let’s start by creating the simpler, frequency table

1.1: Frequency tables

We’ll use the count() function in the dplyr package (in tidyverse) to create a frequency table. It requires two pieces:

  1. The name of the data set that we will pipe into count(): data |> count()
  2. the column we want it to count: count(col_name)

Start by creating a frequency table for class

##    class   n
## 1   Crew 885
## 2  First 325
## 3 Second 285
## 4  Third 706

Something isn’t quite right with the table. The groups aren’t in the correct order! How can we arrange the groups as: First, Second, Third, and Crew?

By converting them to a factor and specifying the order of the groups using the factor() function and levels argument:

When rearranging the group order, it is very, very important that you type the groups names exactly as they appear in the column, including if they are capitalized!

Now we can create the table with the groups in the correct order! This time, save the table as class_freq but change the name of the n column (default name) to freq by including the additional name = "freq" argument:

##    class freq
## 1  First  325
## 2 Second  285
## 3  Third  706
## 4   Crew  885

We can improve how the table looks using the gt() function from the gt package! Make sure not to save these, because once you use the gt() function, you won’t be able to make any changes!

class freq
First 325
Second 285
Third 706
Crew 885

Part 1.2: Relative Frequency Tables

Let’s add a column to our frequency table for the proportions/relative frequencies!

To do this, we can add another step in our pipe chain by using mutate(). mutate() is a function we’ll look at in more detail later, but in short, it will add new columns to our data frame by mutate(new_col_name = ...), where new_col_name is the name of the new column you want to make (like prop).

Try it out in the code chunk below by starting with the titanic data frame, count() and mutate() functions!

##    class freq      prop
## 1  First  325 0.1476602
## 2 Second  285 0.1294866
## 3  Third  706 0.3207633
## 4   Crew  885 0.4020900

We can also round the props using the round() function like we saw earlier inside the mutate() function. Copy your code from the previous code chunk, change the code so the prop column is rounded to 3 decimal places, and save it as class_rel_freq

##    class freq  prop
## 1  First  325 0.148
## 2 Second  285 0.129
## 3  Third  706 0.321
## 4   Crew  885 0.402

Finally, we’ll add a column named percent to the class_rel_freq as seen below:

##    class freq  prop percent
## 1  First  325 0.148    14.8
## 2 Second  285 0.129    12.9
## 3  Third  706 0.321    32.1
## 4   Crew  885 0.402    40.2

And if you wanna get really fancy, instead we can use the paste0() function as seen below:

class_rel_freq <- 
  class_rel_freq |> 
  # Using mutate() again to add a percentage column
  mutate(
    percent = paste0(prop*100, "%")
  )

class_rel_freq |> 
  gt()
class freq prop percent
First 325 0.148 14.8%
Second 285 0.129 12.9%
Third 706 0.321 32.1%
Crew 885 0.402 40.2%