Data Analysis Homework

Author

Dr. M. Berker Yurtseven

Titanic - EBT 555E Homework 1

Prepared by Mücahit Altaş -301241049

Deadline: 2.12.2025 (23:30)

The Titanic was a large transatlantic passenger ship that departed from Southampton on April 10, 1912 and sank in the North Atlantic after striking an iceberg on April 15. Of the 2,224 passengers and crew on board, about 1,500 lost their lives, making it one of the most tragic maritime disasters in history.

Titanic Dataset

In data science, the Titanic passenger data has become a classic introductory dataset. It contains information such as age, gender, passenger class, fare, and survival status. Because these variables are easy to understand and relate to a real-world event, the dataset is widely used to teach fundamental concepts in data analysis and classification.

The dataset gained particular popularity through Kaggle’s early competition “Titanic: Machine Learning from Disaster,” where participants build predictive models to estimate whether a passenger survived. Today it remains a common benchmark for learning data preprocessing, exploratory analysis, and basic machine learning techniques.

Variable Descriptions

Variable	Description
Survived	Survival indicator (0 = did not survive, 1 = survived).
Pclass	Passenger class (1 = first, 2 = second, 3 = third).
Sex	Sex of the passenger (male/female).
Age	Age in years (may include missing values).
SiblingsSpouses	Number of siblings or spouses traveling with the passenger.
ParentsChildren	Number of parents or children traveling with the passenger.
Fare	Ticket fare paid by the passenger.

Using the provided titanic.csv file, answer the 10 questions below.
For each question, write your R code immediately under the question inside an R code chunk ({r}), and make sure that your code runs without errors and produces the requested output.

library(readr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.1.0
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(gt)
titanic <- read_csv("/cloud/project/HW1/titanic.csv")

Rows: 887 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Sex
dbl (6): Survived, Pclass, Age, SiblingsSpouses, ParentsChildren, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

titanic

# A tibble: 887 × 7
   Survived Pclass Sex      Age SiblingsSpouses ParentsChildren  Fare
      <dbl>  <dbl> <chr>  <dbl>           <dbl>           <dbl> <dbl>
 1        0      3 male      22               1               0  7.25
 2        1      1 female    38               1               0 71.3 
 3        1      3 female    26               0               0  7.92
 4        1      1 female    35               1               0 53.1 
 5        0      3 male      35               0               0  8.05
 6        0      3 male      27               0               0  8.46
 7        0      1 male      54               0               0 51.9 
 8        0      3 male       2               3               1 21.1 
 9        1      3 female    27               0               2 11.1 
10        1      2 female    14               1               0 30.1 
# ℹ 877 more rows

Question 1: Compute the summary statistics (mean, median, min, max, sd) of the Age variable in the dataset.

titanic%>% 
  summarise(
    mean_Age=mean(Age, na.rm=TRUE),
    median_Age=median(Age, na.rm=TRUE),
    min_Age=min(Age, na.rm=TRUE),
    max_Age=max(Age, na.rm=TRUE),
    sd_Age=sd(Age, na.rm=TRUE),
    )%>%
  gt()%>%
  tab_header(title="Summary of Age Dataset",
             subtitle = "Source: Titanic Dataset")

mean_Age	median_Age	min_Age	max_Age	sd_Age
Summary of Age Dataset
Source: Titanic Dataset
29.47144	28	0.42	80	14.12191

Question 2: Group the data by Sex and calculate the survival rate for each group.

titanic%>%
  group_by(Sex)%>%
  summarise(
    survival_rate=mean(Survived)
  )%>%
  gt()%>%
  tab_header(title="Survival Rate by Sex",
             subtitle = "Source: Titanic Dataset")%>%
  fmt_number(columns=survival_rate, decimals = 2)

Sex	survival_rate
Survival Rate by Sex
Source: Titanic Dataset
female	0.74
male	0.19

Question 3: Compute the average age for each Pclass, and arrange the table in an ascending order.

titanic%>%
  group_by(Pclass)%>%
  summarise(average_age=mean(Age, na.rm=TRUE))%>%
  arrange(average_age)%>%
  gt()%>%
  tab_header(title="Average Age for Pclass",
             subtitle = "Source: Titanic Dataset")%>%
  fmt_number(columns=average_age, decimals = 2)

Pclass	average_age
Average Age for Pclass
Source: Titanic Dataset
3	25.19
2	29.87
1	38.79

Question 4: Draw three histograms using facet_wrap or facet_grid to show the distribution of fares between different passenger classes (ie, adjust the binsize or binwidth and x breaks accordingly)

titanic%>%
  ggplot(aes(x=Fare))+
  geom_histogram(binwidth=25, color="white", boundary=0)+
  facet_wrap(~Pclass)+
  labs(title="Distribution of fares between different passenger classes",
       subtitle="Titanic Dataset",
       x="Fare",
       y="Price"
       )

Question 5: Show the same distribution with the boxplot (without faceting)

titanic%>%
  ggplot(aes(x=factor(Pclass),y=Fare,fill=Pclass))+
  geom_boxplot()+
  labs(title="Distribution of fares between different passenger classes",
       subtitle="Titanic Dataset",
       x="Passenger Class",
       y="Fare"
       )

Question 6: Filter the dataset to include only survivors and identify both the youngest and the oldest surviving passenger.

titanic%>%
  filter(Survived==1)%>%
  summarise(
    youngest=min(Age),
    oldest=max(Age)
  )%>%
  gt()%>%
  tab_header(title="Youngest & Oldest Survivors",
             subtitle = "Source: Titanic Dataset")

youngest	oldest
Youngest & Oldest Survivors
Source: Titanic Dataset
0.42	80

Question 7: Draw a scatter plot of Age (x) versus Fare (y), coloring points by Sex.

titanic%>%
  ggplot(aes(x=Age,y=Fare,color=Sex))+
  geom_point()

  labs(title="Age Vs Fare",
       subtitle="Titanic Dataset",
       x="Age",
       y="Fare"
       )

<ggplot2::labels> List of 4
 $ x       : chr "Age"
 $ y       : chr "Fare"
 $ title   : chr "Age Vs Fare"
 $ subtitle: chr "Titanic Dataset"

Question 8: Create a new variable FamilySize = SiblingsSpouses + ParentsChildren, and create a boxplot showing FamilySize for each Pclass.

titanic%>%
  mutate(FamilySize = SiblingsSpouses + ParentsChildren+1)%>% 
  #When calculating family size, I added +1 to include the person who owns the family. Otherwise, the result is 0. Family size cannot be 0.
  ggplot(aes(x=as.factor(Pclass),y=FamilySize))+
  geom_boxplot()+
  labs(title="FamilySize & Pclass",
       subtitle="Titanic Dataset",
       x="Pclass",
       y="FamilySize"
       )

Question 9: Calculate mean and median family size for each Pclass

titanic%>%
  mutate(FamilySize = SiblingsSpouses + ParentsChildren+1)%>%
  #When calculating family size, I added +1 to include the person who owns the family. Otherwise, the result is 0. Family size cannot be 0.
  group_by(Pclass)%>%
  summarise(
    mean_FamilySize=mean(FamilySize),
    median_FamilySize=median(FamilySize)
  )

# A tibble: 3 × 3
  Pclass mean_FamilySize median_FamilySize
   <dbl>           <dbl>             <dbl>
1      1            1.77                 1
2      2            1.78                 1
3      3            2.02                 1

Question 10: Count the number of people with zero family size

#Family size could be 0. :D
titanic%>%
  mutate(FamilySize = SiblingsSpouses + ParentsChildren)%>%
  filter(FamilySize==0)%>%
  nrow()

[1] 533