Coordinating Instructions

Background The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

Objective To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!

NAs I mostly ignored them, and did not attempt to discover what information the NA data might disclose upon investigation. That would’ve take too long. ### Packages - tidyverse - stringr - scales - knitr

Available Data

  • All datasets obtained from Kaggle

  • The train_df dataset provides personal records for about 8700 passengers whose teleportation status is known.

  • The test_df dataset provides personal records for the remaining ~4300 passengers whose teleportation status is unknown. The key task is to project which of these passengers was likely teleported.

test_df %>% 
  head(n = 10) %>% 
  kable()
PassengerId HomePlanet CryoSleep Cabin Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck Name
0013_01 Earth TRUE G/3/S TRAPPIST-1e 27 FALSE 0 0 0 0 0 Nelly Carsoning
0018_01 Earth FALSE F/4/S TRAPPIST-1e 19 FALSE 0 9 0 2823 0 Lerome Peckers
0019_01 Europa TRUE C/0/S 55 Cancri e 31 FALSE 0 0 0 0 0 Sabih Unhearfus
0021_01 Europa FALSE C/1/S TRAPPIST-1e 38 FALSE 0 6652 0 181 585 Meratz Caltilter
0023_01 Earth FALSE F/5/S TRAPPIST-1e 20 FALSE 10 0 635 0 0 Brence Harperez
0027_01 Earth FALSE F/7/P TRAPPIST-1e 31 FALSE 0 1615 263 113 60 Karlen Ricks
0029_01 Europa TRUE B/2/P 55 Cancri e 21 FALSE 0 NA 0 0 0 Aldah Ainserfle
0032_01 Europa TRUE D/0/S TRAPPIST-1e 20 FALSE 0 0 0 0 0 Acrabi Pringry
0032_02 Europa TRUE D/0/S 55 Cancri e 23 FALSE 0 0 0 0 0 Dhena Pringry
0033_01 Earth FALSE F/7/S 55 Cancri e 24 FALSE 0 639 0 0 0 Eliana Delazarson

Key Terms

  • PassengerId: A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

  • HomePlanet: The planet the passenger departed from, typically their planet of permanent residence.

  • CryoSleep: Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

  • Cabin: The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

  • Destination: The planet the passenger will be debarking to.

  • Age: The age of the passenger.

  • VIP: Whether the passenger has paid for special VIP service during the voyage.

  • RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities.

  • Name: The first and last names of the passenger.

  • Transported: Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

  • PassengerId: Id for each passenger in the test set.

  • Transported: The target. For each passenger, predict either True or False. (I refer to this as “teleported” throughout this project, also I only predict likelihood)

EDA and Data Cleaning

Addressing Problematic NAs

Addresing the NAs in the categories for amounts billed will help with some of the math later on. We address each dataframes separately to keep them uniform

test_df <- test_df %>% 
  mutate(RoomService = replace_na(test_df$RoomService, 0),
         FoodCourt = replace_na(test_df$FoodCourt, 0),
         ShoppingMall = replace_na(test_df$ShoppingMall, 0),
         Spa = replace_na(test_df$Spa, 0),
         VRDeck = replace_na(test_df$VRDeck))
train_df <- train_df %>% 
  mutate(RoomService = replace_na(train_df$RoomService, 0),
         FoodCourt = replace_na(train_df$FoodCourt, 0),
         ShoppingMall = replace_na(train_df$ShoppingMall, 0),
         Spa = replace_na(train_df$Spa, 0),
         VRDeck = replace_na(train_df$VRDeck))

Separating the Group number out of the PassengerId

This operation will allow us to group by group number should we need to later on; just seems like something we’ll need. I ended up forgetting to use this to determine likelihood of teleportation by assigned group number or passenger number. Oops.

# first make a Tibble that just separates the PassengerId into Group# and Passenger#
train_group_numbers <- as.tibble(str_split(string = train_df$PassengerId, pattern ="_", simplify = T))
test_group_numbers <- as.tibble(str_split(string = test_df$PassengerId, pattern ="_", simplify = T))
#change column names
colnames(test_group_numbers)[1] <- "GroupNum"
colnames(train_group_numbers)[1] <- "GroupNum"
colnames(test_group_numbers)[2] <- "PsngrNum"
colnames(train_group_numbers)[2] <- "PsngrNum"
#Add these columns to the original dataframes
test_df <- test_df %>% 
  mutate(GroupNum = test_group_numbers$GroupNum,
         PsngrNum = test_group_numbers$PsngrNum)
train_df <- train_df %>% 
  mutate(GroupNum = train_group_numbers$GroupNum,
         PsngrNum = train_group_numbers$PsngrNum)
# reorder to put the new columns by the PassengerId
test_df <- test_df %>% 
  select(PassengerId, GroupNum, PsngrNum, Name, HomePlanet:VRDeck)
train_df <- train_df %>% 
  select(PassengerId, GroupNum, PsngrNum, Name, Transported, HomePlanet:VRDeck)

Separating the data within the Cabin vector

This operation will parse out deck assignment and ship side assignment from the Cabin vector. Perhaps where the passenger was sleeping (if they were) made a difference in their likelihood of being teleported

# first make a Tibble that just separates the Cabin assignments into Deck/Number/Side
train_cabin_assignments <- as.tibble(str_split(string = train_df$Cabin, pattern = "/", simplify = T))
test_cabin_assignments <- as.tibble(str_split(string = test_df$Cabin, pattern = "/", simplify = T))
#change column names
colnames(test_cabin_assignments)[1] <- "Deck"
colnames(train_cabin_assignments)[1] <- "Deck"
colnames(test_cabin_assignments)[2] <- "CabinNumber"
colnames(train_cabin_assignments)[2] <- "CabinNumber"
colnames(test_cabin_assignments)[3] <- "CabinSide"
colnames(train_cabin_assignments)[3] <- "CabinSide"
#Add these columns to the original dataframes
test_df <- test_df %>% 
  mutate(Deck = test_cabin_assignments$Deck,
         CabinNumber = test_cabin_assignments$CabinNumber,
         CabinSide = test_cabin_assignments$CabinSide)
train_df <- train_df %>% 
  mutate(Deck = train_cabin_assignments$Deck,
         CabinNumber = train_cabin_assignments$CabinNumber,
         CabinSide = train_cabin_assignments$CabinSide)
# reorder to put the new columns by the PassengerId
test_df <- test_df %>% 
  select(PassengerId:Cabin, Deck, CabinNumber, CabinSide, Destination:VRDeck)
train_df <- train_df %>% 
  select(PassengerId:Cabin, Deck, CabinNumber, CabinSide, Destination:VRDeck)

Mutate a total debt column

This opertion totals the spending of each passenger; perhaps big spenders were spared teleportation.

# First replace NAs with 0s
train_df <- train_df %>% 
  mutate(VRDeck = replace_na(VRDeck, 0))
test_df <- test_df %>% 
  mutate(VRDeck = replace_na(VRDeck, 0))
# Try to mutate a total column
train_df <- train_df %>% 
  group_by(PassengerId) %>% 
  mutate(Total_Debt = rowSums(across(RoomService:VRDeck)))
test_df <- test_df %>% 
  group_by(PassengerId) %>% 
  mutate(Total_Debt = rowSums(across(RoomService:VRDeck)))

Exloring Correlations in Data

ggplot(train_df, aes(HomePlanet, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
  labs(title = "Teleported Status by Home Planet",
       x = "Home Planet",
       y = "Totals")
Majority of Europans and Martians were teleported

Majority of Europans and Martians were teleported

ggplot(train_df, aes(Deck, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", colour = "white", size = 2.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
  labs(title = "Teleported Status by Deck Assignment",
       x = "Deck Assignment",
       y = "Totals")
Decks B, C, and G had a majority get teleported

Decks B, C, and G had a majority get teleported

ggplot(train_df, aes(CabinSide, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
  labs(title = "Teleported Status by Side assignment",
       x = "Side Assignment",
       y = "Totals")
Majority assigned to Starboard side were teleported

Majority assigned to Starboard side were teleported

ggplot(train_df, aes(CryoSleep, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
  labs(title = "Teleported Status by Cryosleep Status",
       x = "Cryo Sleep status",
       y = "Totals")
Majority of those in Cryosleep were teleported

Majority of those in Cryosleep were teleported

This operation will bin ages into groups to make the age variable more manageable

# make a column that bins ages in increments of 10 (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79)
train_df$AgeBins <- cut(train_df$Age, breaks = c(0, 9, 19, 29, 39, 49, 59, 69, 79),
                        labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79"))
test_df$AgeBins <- cut(test_df$Age, breaks = c(0, 9, 19, 29, 39, 49, 59, 69, 79),
                        labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79"))
# Then put them back in the order that is easier to read
train_df <- train_df %>% 
  select(PassengerId:Age, AgeBins, VIP:Total_Debt)
test_df <- test_df %>% 
  select(PassengerId:Age, AgeBins, VIP:Total_Debt)
ggplot(train_df, aes(AgeBins, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
  labs(title = "Teleported Status by Age Range",
       x = "Age Range",
       y = "Totals")
0-19 and 40-49 year olds were most likely to be teleported

0-19 and 40-49 year olds were most likely to be teleported

ggplot(train_df, aes(Destination, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
  labs(title = "Teleported Status by Destination",
       x = "Destination",
       y = "Totals")
Majority bound for 55 Cancri e were Teleported

Majority bound for 55 Cancri e were Teleported

ggplot(train_df, aes(VIP, fill = Transported))+
  geom_bar(position = "dodge", stat = "count")+ 
stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
  labs(title = "Teleported Status by VIP Status",
       x = "VIP Status",
       y = "Totals")
Slight majority of NON-VIP passengers teleported, majority VIP not

Slight majority of NON-VIP passengers teleported, majority VIP not

# some weird scaling issues with this chart
train_df %>% 
  group_by(Transported) %>% 
  summarise(Average_Spent = mean(Total_Debt)) %>% 
  mutate(Average_Spent = dollar(Average_Spent, accuracy = 1e-02)) %>% 
  ggplot(aes(Transported, Average_Spent, fill = Transported))+
  geom_col()+ 
  geom_label(aes(label = Average_Spent), color = "black", fill = "white", show.legend = FALSE, nudge_y = -0.1)+
  labs(title = "Teleportation Status by Average Spent",
       x = "Teleportation Status",
       y = "Average Spent")
Those who were teleported spent on average $1100 less than those who were not

Those who were teleported spent on average $1100 less than those who were not

Creating Probability Vectors in the Data

Odds of being teleported by CryoSleep

this operation creates a dataframe of multiple vectors specific to CryoSleep

Cryo_Odds <- train_df %>% 
  group_by(CryoSleep) %>% 
  mutate(Cryo_Total = n(),
         Cryo_Transported = sum(Transported == TRUE),
         Cryo_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, CryoSleep, Transported, Cryo_Total, Cryo_Transported, Cryo_Remained) %>% 
  group_by(CryoSleep) %>% 
  mutate(Cryo_odds = Cryo_Transported / Cryo_Total)
Cryo_Odds %>% 
  head(n = 10) %>% 
  kable()
PassengerId CryoSleep Transported Cryo_Total Cryo_Transported Cryo_Remained Cryo_odds
0001_01 FALSE FALSE 5439 1789 3650 0.3289208
0002_01 FALSE TRUE 5439 1789 3650 0.3289208
0003_01 FALSE FALSE 5439 1789 3650 0.3289208
0003_02 FALSE FALSE 5439 1789 3650 0.3289208
0004_01 FALSE TRUE 5439 1789 3650 0.3289208
0005_01 FALSE TRUE 5439 1789 3650 0.3289208
0006_01 FALSE TRUE 5439 1789 3650 0.3289208
0006_02 TRUE TRUE 3037 2483 554 0.8175831
0007_01 FALSE TRUE 5439 1789 3650 0.3289208
0008_01 TRUE TRUE 3037 2483 554 0.8175831
# do the same to add the new column to the original
train_df <- train_df %>% 
  group_by(CryoSleep) %>% 
  mutate(Cryo_Total = n(),
         Cryo_Transported = sum(Transported == TRUE),
         Cryo_Remained = sum(Transported == FALSE)) %>% 
  mutate(Cryo_odds = Cryo_Transported / Cryo_Total)

This operation simplifies the cryosleep data

Cryo_Odds_Simplified <- Cryo_Odds %>% 
  ungroup() %>% 
  select(CryoSleep, Cryo_odds) %>% 
  group_by(CryoSleep) %>% 
  summarise(Cryo_odds = mean(Cryo_odds, trim = 3)) %>% 
  mutate(across(Cryo_odds, round, 3))
kable(Cryo_Odds_Simplified, 
      col.names = c("CryoSleep", "Odds"),
      align = "cc")
CryoSleep Odds
FALSE 0.329
TRUE 0.818
NA 0.488

Visualizing CryoSleep data

ggplot(Cryo_Odds_Simplified, aes(CryoSleep, Cryo_odds))+ 
  geom_col(aes(fill = CryoSleep, show.legend = FALSE)) +
  geom_label(aes(label = Cryo_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Cryosleep Status",
       x = "Cryosleep Status",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Those in cryosleep are more likely to be teleported

Those in cryosleep are more likely to be teleported

Odds of being teleported by HomePlanet

Planet_Odds <- train_df %>% 
  group_by(HomePlanet) %>% 
  mutate(Planet_Total = n(),
         Planet_Transported = sum(Transported == TRUE),
         Planet_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, HomePlanet, Transported, Planet_Total, Planet_Transported, Planet_Remained) %>% 
  group_by(HomePlanet) %>% 
  mutate(Planet_odds = Planet_Transported / Planet_Total)
Planet_Odds %>% 
  slice_sample(n = 3) %>% 
  kable()
PassengerId HomePlanet Transported Planet_Total Planet_Transported Planet_Remained Planet_odds
4661_02 Earth FALSE 4602 1951 2651 0.4239461
6225_01 Earth FALSE 4602 1951 2651 0.4239461
8319_01 Earth FALSE 4602 1951 2651 0.4239461
7117_01 Europa TRUE 2131 1404 727 0.6588456
4704_01 Europa FALSE 2131 1404 727 0.6588456
9085_03 Europa TRUE 2131 1404 727 0.6588456
5772_01 Mars FALSE 1759 920 839 0.5230244
6253_01 Mars FALSE 1759 920 839 0.5230244
4082_01 Mars FALSE 1759 920 839 0.5230244
4078_01 NA FALSE 201 103 98 0.5124378
6623_01 NA TRUE 201 103 98 0.5124378
6954_02 NA FALSE 201 103 98 0.5124378

This operation simplifies the new data

Planet_Odds_Simplified <- Planet_Odds %>% 
  ungroup() %>% 
  select(HomePlanet, Planet_odds) %>% 
  group_by(HomePlanet) %>% 
  summarise(Planet_odds = mean(Planet_odds, trim = 3)) %>% 
  mutate(across(Planet_odds, round, 3))
kable(Planet_Odds_Simplified, 
      col.names = c("Home Planet", "Odds"),
      align = "cc")
Home Planet Odds
Earth 0.424
Europa 0.659
Mars 0.523
NA 0.512
# do the same to add the new column to the original
train_df <- train_df %>% 
  group_by(HomePlanet) %>% 
  mutate(Planet_Total = n(),
         Planet_Transported = sum(Transported == TRUE),
         Planet_Remained = sum(Transported == FALSE)) %>% 
  mutate(Planet_odds = Planet_Transported / Planet_Total)
ggplot(Planet_Odds_Simplified, aes(HomePlanet, Planet_odds))+ 
  geom_col(aes(fill = HomePlanet, show.legend = FALSE)) +
  geom_label(aes(label = Planet_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Home Planet",
       x = "Home Planet",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Europans are more likely to be teleported

Europans are more likely to be teleported

Odds of Being teleported by Deck assignment

Deck_Odds <- train_df %>% 
  group_by(Deck) %>% 
  mutate(Deck_Total = n(),
         Deck_Transported = sum(Transported == TRUE),
         Deck_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, Deck, Deck_Total, Transported, Deck_Transported, Deck_Remained) %>% 
  group_by(Deck) %>% 
  mutate(Deck_odds = Deck_Transported / Deck_Total)
Deck_Odds %>% 
  slice_sample(n = 1) %>% 
  kable(align = "ccccccc")
PassengerId Deck Deck_Total Transported Deck_Transported Deck_Remained Deck_odds
6827_02 A 256 TRUE 127 129 0.4960938
6590_01 B 779 FALSE 572 207 0.7342747
1452_04 C 747 FALSE 508 239 0.6800535
1357_01 D 478 TRUE 207 271 0.4330544
1316_02 E 876 FALSE 313 563 0.3573059
4145_01 F 2794 FALSE 1229 1565 0.4398712
1615_03 G 2559 TRUE 1321 1238 0.5162173
1071_01 T 5 FALSE 1 4 0.2000000
0310_01 NA 199 FALSE 100 99 0.5025126
# do the same to add the new column to the original
train_df <- train_df %>%
  group_by(Deck) %>% 
  mutate(Deck_Total = n(),
         Deck_Transported = sum(Transported == TRUE),
         Deck_Remained = sum(Transported == FALSE)) %>% 
  mutate(Deck_odds = Deck_Transported / Deck_Total)

This operation simplifies the data

Deck_Odds_Simplified <- Deck_Odds %>% 
  ungroup() %>% 
  select(Deck, Deck_odds) %>% 
  group_by(Deck) %>% 
  summarise(Deck_odds = mean(Deck_odds, trim = 3)) %>% 
  mutate(across(Deck_odds, round, 3))
kable(Deck_Odds_Simplified,
      align = "cc")
Deck Deck_odds
A 0.496
B 0.734
C 0.680
D 0.433
E 0.357
F 0.440
G 0.516
T 0.200
NA 0.503
ggplot(Deck_Odds_Simplified, aes(Deck, Deck_odds))+ 
  geom_col(aes(fill = Deck, show.legend = FALSE)) +
  geom_label(aes(label = Deck_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Deck Assignment",
       x = "Deck Assignment",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Those assigned to decks B, C, and G are most likely to be teleported

Those assigned to decks B, C, and G are most likely to be teleported

Odds of being transported by vessel Cabin side assignment

CabinSide_Odds <- train_df %>% 
  group_by(CabinSide) %>% 
  mutate(CabinSide_Total = n(),
         CabinSide_Transported = sum(Transported == TRUE),
         CabinSide_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, CabinSide, Transported, CabinSide_Total, CabinSide_Transported, CabinSide_Remained) %>% 
  group_by(CabinSide) %>% 
  mutate(CabinSide_odds = CabinSide_Transported / CabinSide_Total)
CabinSide_Odds %>% 
  slice_sample(n = 3) %>% 
  kable(col.names = c("PassengerID", "CabinSide", "Transported", "Total", "Teleported", "Remained", "Odds"),
        align = "ccccccc")
PassengerID CabinSide Transported Total Teleported Remained Odds
4336_03 TRUE 199 100 99 0.5025126
8770_03 FALSE 199 100 99 0.5025126
1718_01 TRUE 199 100 99 0.5025126
7679_01 P FALSE 4206 1898 2308 0.4512601
9179_02 P FALSE 4206 1898 2308 0.4512601
6976_01 P TRUE 4206 1898 2308 0.4512601
0728_01 S TRUE 4288 2380 1908 0.5550373
3542_01 S TRUE 4288 2380 1908 0.5550373
0762_01 S FALSE 4288 2380 1908 0.5550373
train_df <- train_df %>% 
  group_by(CabinSide) %>% 
  mutate(CabinSide_Total = n(),
         CabinSide_Transported = sum(Transported == TRUE),
         CabinSide_Remained = sum(Transported == FALSE)) %>% 
  mutate(CabinSide_odds = CabinSide_Transported / CabinSide_Total)
CabinSide_Odds_Simplified <- CabinSide_Odds %>% 
  ungroup() %>% 
  select(CabinSide, CabinSide_odds) %>% 
  group_by(CabinSide) %>% 
  summarise(CabinSide_odds = mean(CabinSide_odds, trim = 3)) %>% 
  mutate(across(CabinSide_odds, round, 3))
CabinSide_Odds_Simplified[1,1] <- NA
kable(CabinSide_Odds_Simplified,
      col.names = c("CabinSide", "Odds"),
      align = "cc")
CabinSide Odds
NA 0.503
P 0.451
S 0.555
ggplot(CabinSide_Odds_Simplified, aes(CabinSide, CabinSide_odds))+ 
  geom_col(aes(fill = CabinSide, show.legend = FALSE)) +
  geom_label(aes(label = CabinSide_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Ship Side Assignment",
       x = "Ship Side Assignment",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Passengers assigned to the starboard side are more likely to be teleported

Passengers assigned to the starboard side are more likely to be teleported

Odds of being teleported by VIP status

VIP_Odds <- train_df %>% 
  group_by(VIP) %>% 
  mutate(VIP_Total = n(),
         VIP_Transported = sum(Transported == TRUE),
         VIP_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, VIP, Transported, VIP_Total, VIP_Transported, VIP_Remained) %>% 
  group_by(VIP) %>% 
  mutate(VIP_odds = VIP_Transported / VIP_Total)
VIP_Odds %>% 
  slice_sample(n = 3) %>% 
  kable(align = "ccccccc",
        col.names = c("PassengerID", "VIP", "Transported", "Total", "Teleported", "Remained", "Odds"))
PassengerID VIP Transported Total Teleported Remained Odds
6599_01 FALSE TRUE 8291 4198 4093 0.5063322
2683_01 FALSE TRUE 8291 4198 4093 0.5063322
8570_01 FALSE FALSE 8291 4198 4093 0.5063322
1816_01 TRUE TRUE 199 76 123 0.3819095
1742_01 TRUE FALSE 199 76 123 0.3819095
4543_01 TRUE FALSE 199 76 123 0.3819095
8948_01 NA FALSE 203 104 99 0.5123153
1509_01 NA FALSE 203 104 99 0.5123153
8710_01 NA FALSE 203 104 99 0.5123153
train_df <- train_df %>% 
  group_by(VIP) %>% 
  mutate(VIP_Total = n(),
         VIP_Transported = sum(Transported == TRUE),
         VIP_Remained = sum(Transported == FALSE)) %>% 
  mutate(VIP_odds = VIP_Transported / VIP_Total)
VIP_Odds_Simplified <- VIP_Odds %>% 
  ungroup() %>% 
  select(VIP, VIP_odds) %>% 
  group_by(VIP) %>% 
  summarise(VIP_odds = mean(VIP_odds, trim = 3)) %>% 
  mutate(across(VIP_odds, round, 3))
kable(VIP_Odds_Simplified,
      align = "cc")
VIP VIP_odds
FALSE 0.506
TRUE 0.382
NA 0.512
ggplot(VIP_Odds_Simplified, aes(VIP, VIP_odds))+ 
  geom_col(aes(fill = VIP, show.legend = FALSE)) +
  geom_label(aes(label = VIP_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by VIP Status",
       x = "VIP Status",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
The peasants are more likely to be teleported, obviously.

The peasants are more likely to be teleported, obviously.

Odds of being teleported by AgeBins

AgeBins_Odds <- train_df %>% 
  group_by(AgeBins) %>% 
  mutate(AgeBins_Total = n(),
        AgeBins_Transported = sum(Transported == TRUE),
         AgeBins_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, Age, AgeBins, Transported, AgeBins_Total, AgeBins_Transported, AgeBins_Remained) %>% 
  group_by(AgeBins) %>% 
  mutate(AgeBins_odds = AgeBins_Transported / AgeBins_Total)
AgeBins_Odds %>% 
  slice_sample(n = 1) %>% 
  kable(align = "cccccccc",
        col.names = c("PassengerID","Age", "AgeBin", "Transported", "Total", "Teleported", "Remained", "Odds"))
PassengerID Age AgeBin Transported Total Teleported Remained Odds
8226_01 3 0-9 TRUE 501 339 162 0.6766467
0598_01 19 10-19 FALSE 1479 788 691 0.5327924
5758_01 22 20-29 TRUE 2762 1296 1466 0.4692252
4017_02 30 30-39 TRUE 1735 796 939 0.4587896
8032_02 40 40-49 TRUE 1048 531 517 0.5066794
3382_02 56 50-59 TRUE 557 275 282 0.4937163
1238_02 61 60-69 TRUE 208 99 109 0.4759615
3831_01 73 70-79 FALSE 46 20 26 0.4347826
5714_03 0 NA TRUE 357 234 123 0.6554622
train_df <- train_df %>% 
  group_by(VIP) %>% 
  mutate(AgeBins_Total = n(),
         AgeBins_Transported = sum(Transported == TRUE),
         AgeBins_Remained = sum(Transported == FALSE)) %>% 
  mutate(AgeBins_odds = AgeBins_Transported / AgeBins_Total)
AgeBins_Odds_Simplified <- AgeBins_Odds %>% 
  ungroup() %>% 
  select(AgeBins, AgeBins_odds) %>% 
  group_by(AgeBins) %>% 
  summarise(AgeBins_odds = mean(AgeBins_odds, trim = 3)) %>% 
  mutate(across(AgeBins_odds, round, 3))
kable(AgeBins_Odds_Simplified,
      align = "cc",
      col.names = c("AgeBins", "Odds"))
AgeBins Odds
0-9 0.677
10-19 0.533
20-29 0.469
30-39 0.459
40-49 0.507
50-59 0.494
60-69 0.476
70-79 0.435
NA 0.655
ggplot(AgeBins_Odds_Simplified, aes(AgeBins, AgeBins_odds))+ 
  geom_col(aes(fill = AgeBins, show.legend = FALSE)) +
  geom_label(aes(label = AgeBins_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Age",
       x = "Age",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Children, teenagers, and the adults who act like them are more likely to be teleported.

Children, teenagers, and the adults who act like them are more likely to be teleported.

Odds of being teleported by Destination

Destination_Odds <- train_df %>% 
  group_by(Destination) %>% 
  mutate(Destination_Total = n(),
        Destination_Transported = sum(Transported == TRUE),
         Destination_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, Destination, Transported, Destination_Total, Destination_Transported, Destination_Remained) %>% 
  group_by(Destination) %>% 
  mutate(Destination_odds = Destination_Transported / Destination_Total)
Destination_Odds %>% 
  slice_sample(n = 3) %>% 
  kable(align = "ccccccc",
        col.names = c("PassengerID", "Destination", "Transported", "Total", "Teleported", "Remained", "Odds"))
PassengerID Destination Transported Total Teleported Remained Odds
6744_03 55 Cancri e FALSE 1800 1098 702 0.6100000
8004_01 55 Cancri e TRUE 1800 1098 702 0.6100000
0221_01 55 Cancri e FALSE 1800 1098 702 0.6100000
0781_01 PSO J318.5-22 FALSE 796 401 395 0.5037688
6120_01 PSO J318.5-22 TRUE 796 401 395 0.5037688
5108_01 PSO J318.5-22 TRUE 796 401 395 0.5037688
7020_01 TRAPPIST-1e TRUE 5915 2787 3128 0.4711750
2558_01 TRAPPIST-1e TRUE 5915 2787 3128 0.4711750
3103_01 TRAPPIST-1e FALSE 5915 2787 3128 0.4711750
6889_01 NA FALSE 182 92 90 0.5054945
0504_03 NA TRUE 182 92 90 0.5054945
5859_01 NA TRUE 182 92 90 0.5054945
# do the same to add the new column to the original
train_df <- train_df %>% 
  group_by(Destination) %>% 
  mutate(Destination_Total = n(),
         Destination_Transported = sum(Transported == TRUE),
         Destination_Remained = sum(Transported == FALSE)) %>% 
  mutate(Destination_odds = Destination_Transported / Destination_Total)
Destination_Odds_Simplified <- Destination_Odds %>% 
  ungroup() %>% 
  select(Destination, Destination_odds) %>% 
  group_by(Destination) %>% 
  summarise(Destination_odds = mean(Destination_odds, trim = 3)) %>% 
  mutate(across(Destination_odds, round, 3))
kable(Destination_Odds_Simplified,
      align = "cc")
Destination Destination_odds
55 Cancri e 0.610
PSO J318.5-22 0.504
TRAPPIST-1e 0.471
NA 0.505
ggplot(Destination_Odds_Simplified, aes(Destination, Destination_odds))+ 
  geom_col(aes(fill = Destination, show.legend = FALSE)) +
  geom_label(aes(label = Destination_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Destination",
       x = "Destination",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Passengers bound for 55 Cancri e  and PSO J318.5-22 are more likely to be teleported

Passengers bound for 55 Cancri e and PSO J318.5-22 are more likely to be teleported

Odds of being teleported by Total_Debt

This operation bins the total debt figures into manageable chunks

train_df$DebtBins <- cut(train_df$Total_Debt, breaks = c(-1, 1000, 5000, 10000, 20000, 30000, 100000), labels = c("0-1000", "1-5000", "5-10000", "5-20000", "20-30000", "30000+"))

test_df$DebtBins <- cut(test_df$Total_Debt, breaks = c(-1, 1000, 5000, 10000, 20000, 30000, 100000),labels = c("0-1000", "1-5000", "5-10000", "5-20000", "20-30000", "30000+"))

this operation creates a df just for binned debt totals and their odds

DebtBins_Odds <- train_df %>% 
  group_by(DebtBins) %>% 
  mutate(DebtBins_Total = n(),
        DebtBins_Transported = sum(Transported == TRUE),
         DebtBins_Remained = sum(Transported == FALSE)) %>% 
  ungroup() %>% 
  select(PassengerId, DebtBins, Transported, DebtBins_Total, DebtBins_Transported, DebtBins_Remained) %>% 
  group_by(DebtBins) %>% 
  mutate(DebtBins_odds = DebtBins_Transported / DebtBins_Total)
DebtBins_Odds %>% 
  slice_sample(n = 1) %>% 
  kable(align = "ccccccc",
        col.names = c("PassengerID", "DebtBin", "Transported", "Total", "Teleported", "Remained", "Odds"))
PassengerID DebtBin Transported Total Teleported Remained Odds
4933_04 0-1000 TRUE 5860 3520 2340 0.6006826
1561_01 1-5000 FALSE 2196 653 1543 0.2973588
7833_01 5-10000 FALSE 456 150 306 0.3289474
2589_03 5-20000 FALSE 150 48 102 0.3200000
8970_01 20-30000 FALSE 27 5 22 0.1851852
8556_01 30000+ FALSE 4 2 2 0.5000000
train_df <- train_df %>% 
  group_by(DebtBins) %>% 
  mutate(DebtBins_Total = n(),
         DebtBins_Transported = sum(Transported == TRUE),
         DebtBins_Remained = sum(Transported == FALSE)) %>% 
  mutate(DebtBins_odds = DebtBins_Transported / DebtBins_Total)
DebtBins_Odds_Simplified <- DebtBins_Odds %>% 
  ungroup() %>% 
  select(DebtBins, DebtBins_odds) %>% 
  group_by(DebtBins) %>% 
  summarise(DebtBins_odds = mean(DebtBins_odds, trim = 3)) %>% 
  mutate(across(DebtBins_odds, round, 3))
kable(DebtBins_Odds_Simplified,
      align = "cc",
      col.names = c("DebBin", "Odds"))
DebBin Odds
0-1000 0.601
1-5000 0.297
5-10000 0.329
5-20000 0.320
20-30000 0.185
30000+ 0.500
ggplot(DebtBins_Odds_Simplified, aes(DebtBins, DebtBins_odds))+ 
  geom_col(aes(fill = DebtBins, show.legend = FALSE)) +
  geom_label(aes(label = DebtBins_odds, vjust = 1.5))+
  labs(title = "Odds of Teleportation by Debt Amount",
       x = "Debt Amount",
       y = "Odds of Teleportation")+
  theme(axis.text.y = element_text(size = 1.5))
Those who spent the least and those who spent the most were more likely to be teleported

Those who spent the least and those who spent the most were more likely to be teleported

this operation reorganzied the data for consumption

train_df <- train_df %>% 
  select(PassengerId:HomePlanet, Planet_Total:Planet_odds, CryoSleep:Cryo_odds, Cabin:CabinSide, Deck_Total:CabinSide_odds, Destination, Destination_Total:Destination_odds, Age:AgeBins, AgeBins_Total:AgeBins_odds, VIP, VIP_Total:VIP_odds, RoomService:Total_Debt, DebtBins:DebtBins_odds)

Determine Original Odds of Teleportation for Teleported Passengers

This operation gives a snapshot of the range of odds

train_total_odds <- train_df %>%
  group_by(PassengerId) %>% 
  mutate(Odds = (Planet_odds * Cryo_odds * Deck_odds *CabinSide_odds * Destination_odds * AgeBins_odds * VIP_odds *DebtBins_odds)) %>%
  select(PassengerId, Odds, Transported, GroupNum:Name, HomePlanet:DebtBins_odds)
summary(train_total_odds$Odds)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0005464 0.0020066 0.0028990 0.0047252 0.0062640 0.0211128

To actually be Teleported, a passenger has to have at least a 0.00069 chance to be teleported at all

Train_Mean_Min_Max_Odds <- train_total_odds %>% 
  group_by(Transported) %>% 
  summarise(MeanOdds = mean(Odds),
            MaxOdds = max(Odds),
            MinOdds = min(Odds),
            SDOdds = sd(Odds))
kable(Train_Mean_Min_Max_Odds,
      align = "ccccc")
Transported MeanOdds MaxOdds MinOdds SDOdds
FALSE 0.0027124 0.0139331 0.0005464 0.0017756
TRUE 0.0067090 0.0211128 0.0006870 0.0049713

Create Odds Functions to Apply to Existing Passengers

AgeBins Function this function will assign odds based on a passenger’s age range

age_bins_function <- function(x) {
  case_when(x == "0-9" ~ 0.677,
            x == "10-19" ~ 0.533,
            x == "20-29" ~ 0.469,
            x == "30-39" ~ 0.459,
            x == "40-49" ~ 0.507,
            x == "50-59" ~ 0.494,
            x == "60-69" ~ 0.476,
            x == "70-79" ~ 0.435,
            x == "NA" ~ 0.655)
}

this operation adds the vector created by the function into the test dataframe

test_age_bins <- as.vector(lapply(test_df$AgeBins, age_bins_function))
# add it as a column to test_df
test_df$AgeBinsOdds <- test_age_bins

CabinSide Function

CabinSide_function <- function(x) {
  case_when(x == NA ~ 0.503,
            x == "P" ~ 0.451,
            x == "S" ~ 0.555)
}
test_CabinSide <- as.vector(lapply(test_df$CabinSide, CabinSide_function))
# add it as a column to test_df
test_df$CabinSideOdds <- test_CabinSide
# for some reason, the newly created columns have to be assigned as doubles after being created
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds))

CryoSleep Function

CryoSleep_function <- function(x) {
  case_when(x == "FALSE" ~ 0.329,
            x == "TRUE" ~ 0.451,
            x == NA ~ 0.488)
}
test_CryoSleep <- as.vector(lapply(test_df$CryoSleep, CryoSleep_function))
# add it as a column to test_df
test_df$CryoSleepOdds <- test_CryoSleep
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds),
         CryoSleepOdds = as.double(CryoSleepOdds))

DebtBins Function

DebtBins_function <- function(x) {
  case_when(x == "0-1000" ~ 0.601,
            x == "1-5000" ~ 0.297,
            x == "5-10000" ~ 0.329,
            x == "5-2000" ~ 0.320,
            x == "20-30000" ~ 0.185,
            x == "30000+" ~ 0.500)
}
test_DebtBins <- as.vector(lapply(test_df$DebtBins, DebtBins_function))
# add it as a column to test_df
test_df$DebtBinsOdds <- test_DebtBins
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds),
         CryoSleepOdds = as.double(CryoSleepOdds),
         DebtBinsOdds = as.double(DebtBinsOdds))

Destination Function

Destination_function <- function(x) {
  case_when(x == "55 Cancri e" ~ 0.610,
            x == "PSO J318.5-22" ~ 0.504,
            x == "TRAPPIST-1e" ~ 0.471,
            x == NA ~ 0.505)
}
test_Destination <- as.vector(lapply(test_df$Destination, Destination_function))
# add it as a column to test_df
test_df$DestinationOdds <- test_Destination
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds),
         CryoSleepOdds = as.double(CryoSleepOdds),
         DebtBinsOdds = as.double(DebtBinsOdds),
         DestinationOdds = as.double(DestinationOdds))

HomePlanet Function

HomePlanet_function <- function(x) {
  case_when(x == "Earth" ~ 0.424,
            x == "Europa" ~ 0.659,
            x == "Mars" ~ 0.523,
            x == NA ~ 0.512)
}
test_HomePlanet <- as.vector(lapply(test_df$HomePlanet, HomePlanet_function))
# add it as a column to test_df
test_df$HomePlanetOdds <- test_HomePlanet
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds),
         CryoSleepOdds = as.double(CryoSleepOdds),
         DebtBinsOdds = as.double(DebtBinsOdds),
         DestinationOdds = as.double(DestinationOdds),
         HomePlanetOdds = as.double(HomePlanetOdds))

Deck Function

Deck_function <- function(x) {
  case_when(x == "A" ~ 0.496,
            x == "B" ~ 0.734,
            x == "C" ~ 0.680,
            x == "D" ~ 0.433,
            x == "E" ~ 0.357,
            x == "F" ~ 0.440,
            x == "G" ~ 0.516,
            x == "T" ~ 0.200,
            x == NA ~ 0.503)
}
test_Deck <- as.vector(lapply(test_df$Deck, Deck_function))
# add it as a column to test_df
test_df$DeckOdds <- test_Deck
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds),
         CryoSleepOdds = as.double(CryoSleepOdds),
         DebtBinsOdds = as.double(DebtBinsOdds),
         DestinationOdds = as.double(DestinationOdds),
         HomePlanetOdds = as.double(HomePlanetOdds),
         DeckOdds = as.double(DeckOdds))

VIP function

VIP_function <- function(x) {
  case_when(x == FALSE ~ 0.506,
            x == TRUE ~ 0.382,
            x == NA ~ 0.512)
}
test_VIP <- as.vector(lapply(test_df$VIP, VIP_function))
# add it as a column to test_df
test_df$VIPOdds <- unlist(test_VIP)
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>% 
  mutate(CabinSideOdds = as.double(CabinSideOdds),
         AgeBinsOdds = as.double(AgeBinsOdds),
         CryoSleepOdds = as.double(CryoSleepOdds),
         DebtBinsOdds = as.double(DebtBinsOdds),
         DestinationOdds = as.double(DestinationOdds),
         HomePlanetOdds = as.double(HomePlanetOdds),
         DeckOdds = as.double(DeckOdds),
         VIPOdds = as.double(VIPOdds))
# this operation reorganized the data in the test df
test_df <- test_df %>% 
  select(PassengerId:HomePlanet, HomePlanetOdds, CryoSleep, CryoSleepOdds, Cabin, CabinNumber, Deck, DeckOdds, CabinSide, CabinSideOdds, Destination, DestinationOdds, Age:AgeBins, AgeBinsOdds, VIP, VIPOdds, RoomService:DebtBins, DebtBinsOdds)

Determine Odds for Existing Passengers

To be teleported at all, a passenger’s odds must be at least 0.0007

test_total_odds <- test_df %>%
  ungroup() %>% 
  group_by(PassengerId) %>% 
  mutate(Odds = (HomePlanetOdds * CryoSleepOdds * DeckOdds *CabinSideOdds * DestinationOdds * AgeBinsOdds * VIPOdds * DebtBinsOdds)) 
summary(test_total_odds$Odds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0007  0.0017  0.0027  0.0032  0.0039  0.0120     659

Determining how many of those between 0.0005 - 0.001 were actually teleported

Using the train_df data to determine what percentage of those with particular odds were actually teleported. This figure will be applied to those from the test df

train_total_odds %>% 
  filter(Odds >= 0.0005 & Odds <= 0.001) %>% 
  ggplot(aes(x = Transported, fill = Transported))+
  geom_bar(stat = "count")+
  stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
  labs(title = "Total Teleported with Odds between 0.0005 and 0.001",
       x = "Teleportation Status",
       y = "Total Teleported")+
  theme(axis.text.y = element_text(size = 1.5))
Approximatley 21% chance of teleportation

Approximatley 21% chance of teleportation

Determining how many of those between 0.001- 0.005 were actually teleported

train_total_odds %>% 
  filter(Odds > 0.001 & Odds <= 0.005) %>% 
  ggplot(aes(x = Transported, fill = Transported))+
  geom_bar(stat = "count")+
  stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
  labs(title = "Total Teleported with Odds between 0.001 and 0.005",
       x = "Teleportation Status",
       y = "Total Teleported")+
  theme(axis.text.y = element_text(size = 1.5))
Approximately 34% chance of teleportation

Approximately 34% chance of teleportation

Determining how many of those between 0.005- 0.01 were actually teleported

train_total_odds %>% 
  filter(Odds > 0.005 & Odds <= 0.01) %>% 
  ggplot(aes(x = Transported, fill = Transported))+
  geom_bar(stat = "count")+
  stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
  labs(title = "Total Teleported with Odds between 0.005 and 0.01",
       x = "Teleportation Status",
       y = "Total Teleported")+
  theme(axis.text.y = element_text(size = 1.5))
Approximately 76% chance of teleportation

Approximately 76% chance of teleportation

Determine how many of those between 0.01- 0.05 were actually teleported

train_total_odds %>% 
  filter(Odds > 0.01 & Odds <= 0.05) %>% 
  ggplot(aes(x = Transported, fill = Transported))+
  geom_bar(stat = "count")+
  stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
  labs(title = "Total Teleported with Odds between 0.01 and 0.05",
       x = "Teleportation Status",
       y = "Total Teleported")+
  theme(axis.text.y = element_text(size = 1.5))
Approximately 99% chance of teleportation

Approximately 99% chance of teleportation

Conclusions

This operation creates a function that will give a vector of liklihoods based on a passenger’s overall odds of teleportation

likelihood_function <- function(x) {
  case_when((x >- 0.0005 & x <= 0.001) ~ 21,
            (x > 0.001 & x <= 0.005) ~ 34,
            (x > 0.005 & x <= 0.01) ~ 76,
            (x > 0.01 & x <= 0.05) ~ 99)
}
test_Estimates <- as.vector(lapply(test_total_odds$Odds, likelihood_function))
# add it as a column to test_df
test_total_odds$Percent_Likelihood <- test_Estimates
# reorder columns to make checking easier
test_total_odds <- test_total_odds %>% 
  select(PassengerId, Odds, Percent_Likelihood, GroupNum:DebtBinsOdds)
# For some reason all these columns must be transformed back into doubles
test_total_odds <- test_total_odds %>% 
  mutate(Percent_Likelihood = as.double(Percent_Likelihood))

This operation simplifies for consumption

FINAL <- test_total_odds

passenger_likelihood <- FINAL %>% 
  select(PassengerId, Name, Percent_Likelihood)
passenger_likelihood %>% 
  slice_sample(n = 10) %>% 
  head(n = 12) %>% 
  kable(align = "ccc")
PassengerId Name Percent_Likelihood
0013_01 Nelly Carsoning 34
0018_01 Lerome Peckers 34
0019_01 Sabih Unhearfus 76
0021_01 Meratz Caltilter 34
0023_01 Brence Harperez 34
0027_01 Karlen Ricks 21
0029_01 Aldah Ainserfle 76
0032_01 Acrabi Pringry 34
0032_02 Dhena Pringry 76
0033_01 Eliana Delazarson 34
0037_01 Vivia Rickson 34
0040_01 Antino Pinoffent 34

sample visualiztion of simplified data

passenger_likelihood %>% 
  mutate(Percent_Likelihood = as.factor(Percent_Likelihood)) %>% 
ggplot(aes(Percent_Likelihood, fill = Percent_Likelihood))+
  geom_bar()+
  stat_count(geom = "text", color = "light blue", size = 3.5, aes(label = ..count..), position = position_stack(vjust = 0.5))+
  labs(title = "Total Possible Teleported Passengers by Likelihood",
       x = "Percent Likelihood",
       y = "Total Possible Teleported")+
  theme(axis.text.y = element_text(size = 1.5))