Titanic Data

Load the data

rm(list=ls())
gc()
##           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells  592286 31.7    1357534 72.6         NA   669422 35.8
## Vcells 1115978  8.6    8388608 64.0      16384  1851681 14.2
directory <- "/Users/ruthiemaurer/Desktop/DATA 712"
setwd(directory)

library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(haven)
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ✔ purrr     1.0.4     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(123123)
DATA <- read_xlsx("titanic_data.xlsx", col_names = TRUE)
head(DATA)
## # A tibble: 6 × 12
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
## # ℹ 1 more variable: Embarked <chr>
names(DATA)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
fare_sex <- DATA %>%
  group_by(Sex) %>%
  summarise(Average_Fare = mean(Fare, na.rm = TRUE))

print(fare_sex)
## # A tibble: 2 × 2
##   Sex    Average_Fare
##   <chr>         <dbl>
## 1 female         44.5
## 2 male           25.5

Analysis

The analysis of the Titanic dataset reveals significant disparities in ticket prices and survival rates based on gender and passenger class. When examining ticket prices, women, on average, paid higher fares than men across all passenger classes. Specifically, the average fare for female passengers was $44.48, whereas for male passengers, it was significantly lower at $25.52. This difference was most pronounced in first class, where women paid an average fare of $106.13, compared to $67.23 for men. In contrast, third-class passengers, regardless of gender, had the lowest fares, with men paying an average of $12.66 and women paying $15.50. The substantial gap in fares between first and third-class passengers underscores the socioeconomic divide aboard the Titanic.

# Compare average Fare by Sex and Pclass
fare_sex_pclass <- DATA %>%
  group_by(Sex, Pclass) %>%
  summarise(Average_Fare = mean(Fare, na.rm = TRUE))
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
# Compare average Survival rate by Sex and Pclass
survival_sex_pclass <- DATA %>%
  group_by(Sex, Pclass) %>%
  summarise(Average_Survival = mean(Survived, na.rm = TRUE))
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
# Compare average Fare by Sex
fare_sex <- DATA %>%
  group_by(Sex) %>%
  summarise(Average_Fare = mean(Fare, na.rm = TRUE))

# Compare average Survival by Sex
survival_sex <- DATA %>%
  group_by(Sex) %>%
  summarise(Average_Survival = mean(Survived, na.rm = TRUE)) 

print(fare_sex_pclass)
## # A tibble: 6 × 3
## # Groups:   Sex [2]
##   Sex    Pclass Average_Fare
##   <chr>   <dbl>        <dbl>
## 1 female      1        106. 
## 2 female      2         22.0
## 3 female      3         16.1
## 4 male        1         67.2
## 5 male        2         19.7
## 6 male        3         12.7
print(survival_sex_pclass)
## # A tibble: 6 × 3
## # Groups:   Sex [2]
##   Sex    Pclass Average_Survival
##   <chr>   <dbl>            <dbl>
## 1 female      1            0.968
## 2 female      2            0.921
## 3 female      3            0.5  
## 4 male        1            0.369
## 5 male        2            0.157
## 6 male        3            0.135
survival_sex <- DATA %>%
  group_by(Sex) %>%
  summarise(Average_Survival = mean(Survived, na.rm = TRUE))

print(survival_sex)
## # A tibble: 2 × 2
##   Sex    Average_Survival
##   <chr>             <dbl>
## 1 female            0.742
## 2 male              0.189
survival_pclass <- DATA %>%
  group_by(Pclass) %>%
  summarise(Average_Survival = mean(Survived, na.rm = TRUE))

print(survival_pclass)
## # A tibble: 3 × 2
##   Pclass Average_Survival
##    <dbl>            <dbl>
## 1      1            0.630
## 2      2            0.473
## 3      3            0.242

Survival rates also exhibited substantial differences between genders and passenger classes. Overall, the survival rate for female passengers was 74.2%, whereas for male passengers, it was significantly lower at 18.9%. The “women and children first” evacuation policy could make that significant difference. Regarding passenger class, first-class passengers had the highest survival rate, with 63% surviving. In comparison, second-class passengers had a 47.3% survival rate, and third-class passengers had the lowest survival rate at 24.2%. Among all groups, third-class male passengers fared the worst, with a survival rate of just 13.5%, indicating that class and gender played a significant role in determining who lived and who didn’t.

These statistics highlight the inequalities aboard the Titanic. First-class passengers and women had a much better chance of survival, possibly due to their proximity to lifeboats and better access to evacuation resources and procedures, while those in third-class faced significant barriers to safety. This analysis underscores how deeply social and economic factors influenced the fate of the Titanic’s passengers.

Built-in Dataset: mpg

The mpg, or Fuel economy data from 1999 to 2008 for 38 popular models of cars is a built-in R dataset of data taken from fueleconomy.gov and was created to evaluate the new cars that came out from those years and their popularity. The data variables can be found here.

Using this dataset, I want to see which car manufacturer has the best city and highway miles per gallon (MPG). I also want to compare the fuel types of the vehicles to see which car performs best.

Load the data

data("mpg")
head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"
colnames(mpg) <- c("Manufacturer", "Model", "Displacement", "Year", "Cylinder", "Transmission", "Drive Train", "City MPG", "Highway MPG", "Fuel Type", "Type of Car")
colnames(mpg)
##  [1] "Manufacturer" "Model"        "Displacement" "Year"         "Cylinder"    
##  [6] "Transmission" "Drive Train"  "City MPG"     "Highway MPG"  "Fuel Type"   
## [11] "Type of Car"
print(mpg)
## # A tibble: 234 × 11
##    Manufacturer Model     Displacement  Year Cylinder Transmission `Drive Train`
##    <chr>        <chr>            <dbl> <int>    <int> <chr>        <chr>        
##  1 audi         a4                 1.8  1999        4 auto(l5)     f            
##  2 audi         a4                 1.8  1999        4 manual(m5)   f            
##  3 audi         a4                 2    2008        4 manual(m6)   f            
##  4 audi         a4                 2    2008        4 auto(av)     f            
##  5 audi         a4                 2.8  1999        6 auto(l5)     f            
##  6 audi         a4                 2.8  1999        6 manual(m5)   f            
##  7 audi         a4                 3.1  2008        6 auto(av)     f            
##  8 audi         a4 quatt…          1.8  1999        4 manual(m5)   4            
##  9 audi         a4 quatt…          1.8  1999        4 auto(l5)     4            
## 10 audi         a4 quatt…          2    2008        4 manual(m6)   4            
## # ℹ 224 more rows
## # ℹ 4 more variables: `City MPG` <int>, `Highway MPG` <int>, `Fuel Type` <chr>,
## #   `Type of Car` <chr>
# Compute average City MPG and Highway MPG by Manufacturer
mpg_by_manufacturer <- mpg %>%
  group_by(Manufacturer) %>%
  summarise(Average_City_MPG = mean(`City MPG`, na.rm = TRUE),
            Average_Highway_MPG = mean(`Highway MPG`, na.rm = TRUE))

mpg_by_fuel <- mpg %>%
  group_by(`Fuel Type`) %>%
  summarise(Average_City_MPG = mean(`City MPG`, na.rm = TRUE),
            Average_Highway_MPG = mean(`Highway MPG`, na.rm = TRUE))

print(mpg_by_manufacturer)
## # A tibble: 15 × 3
##    Manufacturer Average_City_MPG Average_Highway_MPG
##    <chr>                   <dbl>               <dbl>
##  1 audi                     17.6                26.4
##  2 chevrolet                15                  21.9
##  3 dodge                    13.1                17.9
##  4 ford                     14                  19.4
##  5 honda                    24.4                32.6
##  6 hyundai                  18.6                26.9
##  7 jeep                     13.5                17.6
##  8 land rover               11.5                16.5
##  9 lincoln                  11.3                17  
## 10 mercury                  13.2                18  
## 11 nissan                   18.1                24.6
## 12 pontiac                  17                  26.4
## 13 subaru                   19.3                25.6
## 14 toyota                   18.5                24.9
## 15 volkswagen               20.9                29.2
print(mpg_by_fuel)
## # A tibble: 5 × 3
##   `Fuel Type` Average_City_MPG Average_Highway_MPG
##   <chr>                  <dbl>               <dbl>
## 1 c                      24                   36  
## 2 d                      25.6                 33.6
## 3 e                       9.75                13.2
## 4 p                      17.4                 25.2
## 5 r                      16.7                 23.0
mpg_by_manufacturer <- mpg %>%
  group_by(Manufacturer) %>%
  summarise(Average_City_MPG = mean(`City MPG`, na.rm = TRUE),
            Average_Highway_MPG = mean(`Highway MPG`, na.rm = TRUE))

# Find the manufacturer with the best City MPG
best_city_mpg <- mpg_by_manufacturer %>%
  filter(Average_City_MPG == max(mpg_by_manufacturer$Average_City_MPG))

# Find the manufacturer with the best Highway MPG
best_highway_mpg <- mpg_by_manufacturer %>%
  filter(Average_Highway_MPG == max(mpg_by_manufacturer$Average_Highway_MPG))

print(best_city_mpg)
## # A tibble: 1 × 3
##   Manufacturer Average_City_MPG Average_Highway_MPG
##   <chr>                   <dbl>               <dbl>
## 1 honda                    24.4                32.6
print(best_highway_mpg)
## # A tibble: 1 × 3
##   Manufacturer Average_City_MPG Average_Highway_MPG
##   <chr>                   <dbl>               <dbl>
## 1 honda                    24.4                32.6

Analysis

After analyzing the data, we can conclude that Honda was the manufacturer with the best city and highway MPG. The results showed Honda’s city MPG averaging 24.44 miles per gallon, making it the most fuel-efficient option for urban driving conditions. Honda’s Highway MPG averages 32.56 miles per gallon, making it the best choice for long-distance travel.

Volkswagen and Subaru also performed well. Volkswagen achieved 20.92 MPG in the city and 29.22 MPG on the highway, while Subaru recorded 19.28 MPG in the city and 25.57 MPG on the highway.

In contrast, the least fuel-efficient manufacturers were Land Rover, with 11.5 MPG city and 16.5 MPG highway, and Lincoln, with 11.33 MPG city and 17 MPG highway, likely due to the larger SUVs in their lineup. I further illustrated the findings in the bar charts below.

# Visualization: City vs. Highway MPG by Manufacturer with custom colors
ggplot(mpg_by_manufacturer, aes(x = Manufacturer)) +
  geom_bar(aes(y = Average_City_MPG, fill = "City MPG"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = Average_Highway_MPG, fill = "Highway MPG"), stat = "identity", position = "dodge", alpha=0.7) +
  scale_fill_manual(values = c("City MPG" = "blue", "Highway MPG" = "lightblue")) +  
  labs(title = "Average City & Highway MPG by Manufacturer", y = "Miles Per Gallon", fill = "MPG Type") +
  theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  

# Visualization: City vs. Highway MPG by Fuel Type with custom colors
ggplot(mpg_by_fuel, aes(x = `Fuel Type`)) +
  geom_bar(aes(y = Average_City_MPG, fill = "City MPG"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = Average_Highway_MPG, fill = "Highway MPG"), stat = "identity", position = "dodge", alpha=0.7) +
  scale_fill_manual(values = c("City MPG" = "darkred", "Highway MPG" = "firebrick")) +
  labs(title = "Average City & Highway MPG by Fuel Type", y = "Miles Per Gallon", fill = "MPG Type") +
  theme_minimal()

When analyzing the fuel types, I found that diesel-powered vehicles (d) generally performed the best overall, achieving 25.6 MPG in the city and 33.6 MPG on the highway. Meanwhile, compressed natural gas (c) vehicles followed closely with 24 MPG city and 36 MPG highway, making them a good choice for long-distance driving. Petrol (p) vehicles averaged 17.36 MPG in the city and 25.23 MPG on highways, making them more fuel-intensive than diesel. Interestingly, electric cars (e) showed significantly lower MPG-equivalent ratings, averaging 9.75 MPG in the city and 13.25 MPG on the highway.

Now that we know this, one can make a more informed decision about which car to purchase based on their needs and the car’s performance.