Quarto Refresher and Publishing

Titanic

Author

Robin Chavez

INTRODUCTION

The content above in lines 1 through 17 is called the YAML, and it is useful to set the document title and specify the output when you render the file. Rendering options include html, Word, and .pdf. Here, we will render the document into an html file with folded code that wraps. Note, too that we include a table of contents (toc) and we can apply the global options to suppress warnings and messages as well.

This text in the white sections are areas where you can insert narratives and other useful information. The backslash at the end is a carriage return – which means start a new line, otherwise this text will run together.

CODE CHUNKS AND INLINE CODE REFRESHER

The gray section below is called a code chunk. This is where you will place R script.

All code chunks start and end with three tick marks — use the key just below the ESC key on a Windows keyboard.

The first line needs the {r} to indicate that this code chunk is using R script. (Quarto will run with other code types such as Python and you can include different code chunks in the same document. However, for this course, we will only use R throughout.)

You can run all of the commands just in this code chunk by clicking on the green play arrow in the upper right corner of the chunk. Go ahead and run the code below. Note, you may need to install the titanic package first using the Packages tab in the lower right quadrant of RStudio.

DATA UNDERSTANDING

For this exercise, we will use the dataset that is stored in the titanic package. Note that the data are already split into a training and test datasets. We want to use all records, so the first thing we will do is the concatenate the datasets (bind_rows) into one.

Code
# combine test and train datasets
library(titanic)
library(dplyr)
titanic <- bind_rows(titanic_train, titanic_test) 

View the Data

Let’s understand the structure of the dataset as well as view a few of the first and last observations.

Code
# view data and metadata 
str(titanic)        # look at the structure of the dataset
'data.frame':   1309 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
Code
head(titanic,3)   # look at the first 3 rows of the dataset 
  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
Code
tail(titanic,3)  # look at the last 3 rows of the dataset
     PassengerId Survived Pclass                         Name  Sex  Age SibSp
1307        1307       NA      3 Saether, Mr. Simon Sivertsen male 38.5     0
1308        1308       NA      3          Ware, Mr. Frederick male   NA     0
1309        1309       NA      3     Peter, Master. Michael J male   NA     1
     Parch             Ticket    Fare Cabin Embarked
1307     0 SOTON/O.Q. 3101262  7.2500              S
1308     0             359309  8.0500              S
1309     1               2668 22.3583              C

Descriptive Statistics

Note that important variables such as \(Survived\) and \(Age\) have missing values.

Code
summary(titanic)  # look at the descriptive statistics for the numeric variables
  PassengerId      Survived          Pclass          Name          
 Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
 1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median : 655   Median :0.0000   Median :3.000   Mode  :character  
 Mean   : 655   Mean   :0.3838   Mean   :2.295                     
 3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :1309   Max.   :1.0000   Max.   :3.000                     
                NA's   :418                                        
     Sex                 Age            SibSp            Parch      
 Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
 Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
 Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
                    Mean   :29.88   Mean   :0.4989   Mean   :0.385  
                    3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
                    Max.   :80.00   Max.   :8.0000   Max.   :9.000  
                    NA's   :263                                     
    Ticket               Fare            Cabin             Embarked        
 Length:1309        Min.   :  0.000   Length:1309        Length:1309       
 Class :character   1st Qu.:  7.896   Class :character   Class :character  
 Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
                    Mean   : 33.295                                        
                    3rd Qu.: 31.275                                        
                    Max.   :512.329                                        
                    NA's   :1                                              

Passenger Lookup

John Jacob Astor, one of the wealthiest people in the world, was aboard the Titanic. Let’s see if he survived.

Code
astor <- titanic[grepl("Astor", titanic$Name), ]
print(astor %>% select(Name, Survived))
                                                  Name Survived
701  Astor, Mrs. John Jacob (Madeleine Talmadge Force)        1
1094                            Astor, Col. John Jacob       NA

Col. Astor was not known to have survived nor was he confirmed as a non-survivor. The NA means he was not recovered. His wife, however, did survive.

Average Age

Code
#create a new variable that is the average age.  the na.rm=TRUE tells R to skip missing values.
xage <- round(mean(titanic$Age, na.rm = TRUE), digits = 2)
xage
[1] 29.88

Rather than hardcode the value for the age, we will use inline code to print the value in the narrative. This is useful if the variable you create is dynamic, such as when you frequently update a dataset with new data. To do this, we use inline code, like this:
- The average age of all passengers on board the Titanic was 29.88.
- You will see the answer when you render the code.


PRACTICE CODE CHUNKS AND INLINE CODE

Let’s run just a few more bits of analysis. Specifically, let’s see if we can answer the following questions:
1. How many passengers are male and how many female?
2. How many passengers survived and how many died?
3. How many females survived? How many died?
4. What percent of females survived?
5. What percent of survivors were male?
6. What percent of passengers were females who perished?
7. What is the distribution of passengers by fare class and embark location?

1. Number of Passengers by Gender

The following table shows the number of female and male passengers aboard the Titanic.

Code
table(titanic$Sex)

female   male 
   466    843 

However, it may be nicer to show the answer as inline text vs code output:

Code
females <- nrow(titanic[titanic$Sex == "female", ])
males   <- nrow(titanic[titanic$Sex == "male", ])
  • There were 466 females and 843 males aboard the Titanic.

2. Number of Survivors and Non-Survivors

The following table shows the number of known survivors and non-survivors from the Titanic.
Note that many of the passengers were unaccounted (Unsure).

Code
# convert the variable to a factor
titanic$Survived.f <- as.factor(ifelse(is.na(titanic$Survived), "Unsure",  # if the value is NA, then unsure
                                       ifelse(titanic$Survived == 0, "Did Not Survive", 
                                                 "Survived")))
table1 <- as.data.frame(table(titanic$Survived.f))
table1
             Var1 Freq
1 Did Not Survive  549
2        Survived  342
3          Unsure  418

3. Number of Female Survivors and Non-Survivors

The following table shows the number of survivors and non-survivors by gender.

Code
table(titanic$Sex,titanic$Survived.f)
        
         Did Not Survive Survived Unsure
  female              81      233    152
  male               468      109    266
Code
# use this to create the values for inline code
female_survived <- nrow(titanic[titanic$Sex == "female" & titanic$Survived.f == "Survived", ])
female_nonsurvived <- nrow(titanic[titanic$Sex == "female" & titanic$Survived.f == "Did Not Survive" , ])

Records show that among the females on board the Titanic, 233 survived and 81 did not survive.


4. Percentage of Female Survivors and Non-Survivors

While the prior table showed the number, this table depicts the percentage of survivors by gender.

Code
# create contingency table
library(summarytools)
ctable(titanic$Sex,titanic$Survived.f, prop="r")
Cross-Tabulation, Row Proportions  
Sex * Survived.f  
Data Frame: titanic  

-------- ------------ ----------------- ------------- ------------- ---------------
           Survived.f   Did Not Survive      Survived        Unsure           Total
     Sex                                                                           
  female                     81 (17.4%)   233 (50.0%)   152 (32.6%)    466 (100.0%)
    male                    468 (55.5%)   109 (12.9%)   266 (31.6%)    843 (100.0%)
   Total                    549 (41.9%)   342 (26.1%)   418 (31.9%)   1309 (100.0%)
-------- ------------ ----------------- ------------- ------------- ---------------
Code
# rather do this by hand so you can use inline code
total_females <- sum(titanic$Sex == "female") # count females

percentage_female_survived <- round((female_survived / total_females) * 100, digits=1)
percentage_female_nonsurvived <- round((female_nonsurvived / total_females) * 100, digits=1)

Among Females on board the Titanic, 50% survived and 17.4% did not survive.


5. Percentage of Survivors that are Male

The following table shows the distribution of survivors by gender.

Code
#inline code
total_males <- sum(titanic$Sex == "male")

#count male survivors
male_survived <- nrow(titanic[titanic$Sex == "male" & titanic$Survived.f == "Survived", ])
male_nonsurvived <- nrow(titanic[titanic$Sex == "male" & titanic$Survived.f == "Did Not Survive" , ])

percentage_male_survived <- round((male_survived/total_males)*100, digits = 1)
percentage_male_nonsurvived <- round((male_nonsurvived / total_males) * 100, digits=1)

Among Males on board the Titanic, 12.9% survived and 55.5% did not survive.


6. Percent of all Titanic passengers were females who perished

Code
total_passengers <- nrow(titanic)
female_perished <- nrow(titanic[titanic$Sex == "female" & titanic$Survived.f == "Did Not Survive", ])
percentage_female_perished <- round((female_perished / total_passengers) * 100, digits = 1)

Among all Passengers in the Titanic 6.2% were females who perished.


7. Distribution of passengers by fare class and embark location?

Note that passengers embarked at one of three locations (in order).
S: Southampton, England
C: Cherbourg, France
Q: Queenstown, Ireland

Code
# fix missing labels and convert to a factor
titanic$Embarked <- ifelse(titanic$Embarked == "", "Unknown", titanic$Embarked)
titanic$Embarked.f <- as.factor(titanic$Embarked)

# create a contingency table
embarked_fare_table <- table(titanic$Embarked.f, titanic$Pclass)
# convert to a data frame for better presentation
embarked_fare_df <- as.data.frame.matrix(embarked_fare_table)
embarked_fare_df
          1   2   3
C       141  28 101
Q         3   7 113
S       177 242 495
Unknown   2   0   0
Code
# to print in line
embarked_fare_total_S <- nrow(titanic[titanic$Embarked.f == "S", ])
embarked_fare_total_c <- nrow(titanic[titanic$Embarked.f == "C", ])
embarked_fare_total_q <- nrow(titanic[titanic$Embarked.f == "Q", ])

embarked_fare_s1 <- nrow(titanic[titanic$Embarked.f == "S" & titanic$Pclass == 1, ])
embarked_fare_s2 <- nrow(titanic[titanic$Embarked.f == "S" & titanic$Pclass == 2, ])
embarked_fare_s3 <- nrow(titanic[titanic$Embarked.f == "S" & titanic$Pclass == 3, ])

embarked_fare_c1 <- nrow(titanic[titanic$Embarked.f == "C" & titanic$Pclass == 1, ])
embarked_fare_c2 <- nrow(titanic[titanic$Embarked.f == "C" & titanic$Pclass == 2, ])
embarked_fare_c3 <- nrow(titanic[titanic$Embarked.f == "C" & titanic$Pclass == 3, ])

embarked_fare_q1 <- nrow(titanic[titanic$Embarked.f == "Q" & titanic$Pclass == 1, ])
embarked_fare_q2 <- nrow(titanic[titanic$Embarked.f == "Q" & titanic$Pclass == 2, ])
embarked_fare_q3 <- nrow(titanic[titanic$Embarked.f == "Q" & titanic$Pclass == 3, ])

The results showed that 914 passengers embarked at Southampton, England with the following class distributions
⦁ in class 1: 177 passengers
⦁ in class 2: 242 passengers
⦁ in class 3: 495 passengers 

The results showed that 270 passengers embarked at Cherbourg, France with the following class distributions
⦁ in class 1: 141 passengers
⦁ in class 2: 28 passengers
⦁ in class 3: 101 passengers

Finally, There were 123 passengers that embarked at Queenstown, Ireland with the following class distributions
⦁ in class 1: 3 passengers
⦁ in class 2: 7 passengers
⦁ in class 3 113 passengers 


CREATING PLOTS

Let’s create three plots using ggplot2. We will learn more about the capabilities of ggplot2 over this course. We will practice a few here.
The first two charts are similar – one shows the levels, and the other shows the values in percent.

Total Number of Passengers by Survival Status and Gender

Code
library(ggplot2)
ggplot(titanic, aes(x = Sex, fill = Survived.f)) +
  geom_bar(position = "dodge") +
  geom_text(aes(label = after_stat(count)), stat = "count", position = position_dodge(width = 0.9), vjust = -0.25) + #data labels
  labs(x = "Gender", y = "Count", fill = "Survival Status",
       title = "Number of Titanic Passengers by Survival and Gender",
       subtitle = "The largest group were men who did not survive")

The chart shows that female had 81 people that did not survive, 233 that survived, and 152 that are unknown cases. In comparison with males where 468 did not survived, only 109 survived and 266 are unknown. In conclusion Males are the ones who did not survived more than womens and the ones that have more unknown cases.

Percentage of Total Passengers by Survival and Gender

This is a similar chart as above, but showing the data labels as a percent of Total Passengers rather than raw values.
Note that we first calculate the percentage using dplyr then we use that data to create the chart. We also add data labels.

Code
# summarize the data to create the percentages
pip <- titanic %>%
  group_by(Sex, Survived.f) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  mutate(Total = sum(Count), Percentage = (Count / Total) * 100)
pip
# A tibble: 6 × 5
  Sex    Survived.f      Count Total Percentage
  <chr>  <fct>           <int> <int>      <dbl>
1 female Did Not Survive    81  1309       6.19
2 female Survived          233  1309      17.8 
3 female Unsure            152  1309      11.6 
4 male   Did Not Survive   468  1309      35.8 
5 male   Survived          109  1309       8.33
6 male   Unsure            266  1309      20.3 
Code
# Create the dodged bar chart 
ggplot(pip, aes(x = Sex, y = Percentage, fill = Survived.f)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
  geom_text(aes(label = sprintf("%.0f%%", Percentage), #sprintf rounds the percentage to 0 decimals
                            group = Survived.f), 
                        position = position_dodge(width = 0.9), vjust = -0.25) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "Gender", y = " ", fill = "Survival Status", 
         title = "Percentage of Total Titanic Passengers by Survival and Gender",
         subtitle = "Nearly 36% of all passengers were men who did not survive") +
    theme(axis.text.y = element_blank())  # Suppress y-axis labels

The chart shows the percentage of survivors, nosurvivors, and unknown cases of the Titanic. In fact, we can prove that in percentage 36% of the total passengers were males that did not survive, and 20% were unknown cases. Demonstrating that males in general did not survive.

Average Age by Fare Class

Create a bar chart that shows the average age of passengers by fare class – 1st Class, 2nd Class, or 3rd Class.

Code
# Create a file with the average ages
library(dplyr)
mean_ages <- titanic %>%
    group_by(Pclass) %>%
    summarise(MeanAge = round(mean(Age, na.rm=TRUE), 0))

# Create the bar chart
library(ggplot2)
ggplot(data=mean_ages, aes(x=Pclass, y=MeanAge)) +
  geom_bar(stat="identity", fill="light blue", position="dodge") +
  geom_text(aes(label=round(MeanAge, 1)), vjust=-0.5) +
    ylim(0,45) +
  labs(x = "Passenger Class", y="Average Age",
       title = "Average Age by Passenger Class",
       subtitle = "First class passengers were older, on average.",
       caption = "Source: titanic dataset") 

Here we created the plot that shows the average age by fare class. First class Titanic passengers were, on average, older than those in second and third class.


YOUR TURN!

Create and interpret two charts that address embark location and fare class.
Include both the intent, the chart, and the interpretation.
Be prepared to share your charts with the class.

Distribution of Passengers by Embark Location

Code
# Create a bar chart to show the distribution of passengers by embark location
ggplot(titanic, aes(x = Embarked.f, fill = factor(Pclass))) +
  geom_bar(position = "dodge") +
  geom_text(stat = "count", aes(label = ..count..), 
            position = position_dodge(width = 0.9), vjust = -0.5, color = "black") +
  labs(x = "Embark Location", y = "Count", fill = "Passenger Class",
       title = "Distribution of Titanic Passengers by Embark Location and Class",
       subtitle = "Most passengers embarked at Southampton, England") +
  scale_fill_manual(values = c("#22ab50", "#f7685e", "#2c88f2")) +  # Change colors
  scale_x_discrete(labels = c("S" = "Southampton", "C" = "Cherbourg", "Q" = "Queenstown")) +  # Change labels
  theme_minimal()

The chart shows that passengers embarked in locations such as Cherbourg, Queenstwon, Southampton. The majority of passengers in the three classes embarked at Southampton with a total of 177 passengers in class 1, 242 in class 2, and 495 in class 3. The minority of the passengers embarked in Queenstown with a total of 3 passengers in class 1, 7 in class 2, and 113 in class 3. Finally, in Cherbourg there were 141 passengers in class 1, 28 in class 2 and 101 in class 3.

Percentage of Passengers by Embark Location and Fare Class

Code
# Calculate percentages for each combination of embark location and fare class
embarked_class_percentage <- titanic %>%
  group_by(Embarked.f, Pclass) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  mutate(Total = sum(Count), Percentage = (Count / Total) * 100)

# Create a dodged bar chart to show the percentage of passengers by embark location and fare class
ggplot(embarked_class_percentage, aes(x = Embarked.f, y = Percentage, fill = factor(Pclass))) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
  geom_text(aes(label = sprintf("%.0f%%", Percentage), group = factor(Pclass)),
            position = position_dodge(width = 0.9), vjust = -0.25) +
  labs(x = "Embark Location", y = "Percentage", fill = "Passenger Class",
       title = "Percentage of Titanic Passengers by Embark Location and Class",
       subtitle = "The majority of passengers from all embark locations were in 3rd class") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_discrete(labels = c("S" = "Southampton", "C" = "Cherbourg", "Q" = "Queenstown")) +  # Change labels
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) 

The chart shows the same results as the previous one, but it shows the percentage of the distribution of each of the embarked locations and the different class. In fact, we can determine that Southampton had the majority percentage in embarking passengers in its three classes. Queenstown and Cherbourg had the minority percentage of passengers embarking.

Survival Status by fare class

Code
# Survival by fare class
library(dplyr)
survival_counts <- titanic %>%
  group_by(Pclass, Survived.f) %>%
  summarise(Count = n())

# Create the bar chart with numbers
library(ggplot2)
ggplot(data=survival_counts, aes(x=Pclass, y=Count, fill=Survived.f)) +
  geom_bar(stat="identity", position="dodge") +
  geom_text(aes(label = Count), position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(x = "Passenger Class", y="Number of Passengers",
       title = "Number of Passengers by Survival Status and Passenger Class",
       subtitle = "Class 3 had the more survivors",
       caption = "Source: Titanic dataset",
       fill = "Survival Status") +
  scale_fill_manual(values=c("#22ab50", "#f7685e", "#2c88f2"), name="Survival Status",
                    labels=c("Survived", "Did Not Survive", "Unknown"))

We created a bar chart illustrating the distribution of survival status across different passenger classes. In Class 1, fewer passengers survived with 80 passengers, 136 did not survived and there is a notable presence of 107 individuals with unknown survival status. Class 2 displays a relatively balanced distribution among the three variables: survived with 97, not survived with 87, and unknown with 93. Conversely, in Class 3, there were more survivors with 372 passengers than in the other classes, along with a count of 119 passengers that did not survived and unknown with 218 passengers.

Correlation between passenger’s age and the fare they paid

Code
# Create the scatter plot
ggplot(titanic, aes(x=Age, y=Fare)) +
  geom_point(aes(color = factor(Pclass)), size=3, alpha=0.7) +
  labs(x = "Age", y="Fare",
       title = "Scatter Plot of Age vs. Fare",
       caption = "Source: Titanic dataset",
       color = "Passenger Class") +
  scale_color_manual(values=c("#1f78b4", "#33a02c", "#e31a1c"))

The scatter plot reveals distinct trends among passenger classes. Class 1 passengers, aged 0 to 80, paid higher fares ($100 to over $500), suggesting an older demographic willing to invest in premium travel. Class 2 and 3 passengers (aged 0 to 70) predominantly paid fares below $100, indicating a younger demographic opting for more economical options. 

###Last sectione intent, the chart, and the interpretation.
Be prepared to share your charts with the class.

END