Background The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!
Objective To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!
NAs I mostly ignored them, and did not attempt to discover what information the NA data might disclose upon investigation. That would’ve take too long. ### Packages - tidyverse - stringr - scales - knitr
All datasets obtained from Kaggle
The train_df dataset provides personal records for about 8700 passengers whose teleportation status is known.
The test_df dataset provides personal records for the remaining ~4300 passengers whose teleportation status is unknown. The key task is to project which of these passengers was likely teleported.
test_df %>%
head(n = 10) %>%
kable()
| PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0013_01 | Earth | TRUE | G/3/S | TRAPPIST-1e | 27 | FALSE | 0 | 0 | 0 | 0 | 0 | Nelly Carsoning |
| 0018_01 | Earth | FALSE | F/4/S | TRAPPIST-1e | 19 | FALSE | 0 | 9 | 0 | 2823 | 0 | Lerome Peckers |
| 0019_01 | Europa | TRUE | C/0/S | 55 Cancri e | 31 | FALSE | 0 | 0 | 0 | 0 | 0 | Sabih Unhearfus |
| 0021_01 | Europa | FALSE | C/1/S | TRAPPIST-1e | 38 | FALSE | 0 | 6652 | 0 | 181 | 585 | Meratz Caltilter |
| 0023_01 | Earth | FALSE | F/5/S | TRAPPIST-1e | 20 | FALSE | 10 | 0 | 635 | 0 | 0 | Brence Harperez |
| 0027_01 | Earth | FALSE | F/7/P | TRAPPIST-1e | 31 | FALSE | 0 | 1615 | 263 | 113 | 60 | Karlen Ricks |
| 0029_01 | Europa | TRUE | B/2/P | 55 Cancri e | 21 | FALSE | 0 | NA | 0 | 0 | 0 | Aldah Ainserfle |
| 0032_01 | Europa | TRUE | D/0/S | TRAPPIST-1e | 20 | FALSE | 0 | 0 | 0 | 0 | 0 | Acrabi Pringry |
| 0032_02 | Europa | TRUE | D/0/S | 55 Cancri e | 23 | FALSE | 0 | 0 | 0 | 0 | 0 | Dhena Pringry |
| 0033_01 | Earth | FALSE | F/7/S | 55 Cancri e | 24 | FALSE | 0 | 639 | 0 | 0 | 0 | Eliana Delazarson |
PassengerId: A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
HomePlanet: The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep: Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
Cabin: The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
Destination: The planet the passenger will be debarking to.
Age: The age of the passenger.
VIP: Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities.
Name: The first and last names of the passenger.
Transported: Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
PassengerId: Id for each passenger in the test set.
Transported: The target. For each passenger, predict either True or False. (I refer to this as “teleported” throughout this project, also I only predict likelihood)
Addresing the NAs in the categories for amounts billed will help with some of the math later on. We address each dataframes separately to keep them uniform
test_df <- test_df %>%
mutate(RoomService = replace_na(test_df$RoomService, 0),
FoodCourt = replace_na(test_df$FoodCourt, 0),
ShoppingMall = replace_na(test_df$ShoppingMall, 0),
Spa = replace_na(test_df$Spa, 0),
VRDeck = replace_na(test_df$VRDeck))
train_df <- train_df %>%
mutate(RoomService = replace_na(train_df$RoomService, 0),
FoodCourt = replace_na(train_df$FoodCourt, 0),
ShoppingMall = replace_na(train_df$ShoppingMall, 0),
Spa = replace_na(train_df$Spa, 0),
VRDeck = replace_na(train_df$VRDeck))
This operation will allow us to group by group number should we need to later on; just seems like something we’ll need. I ended up forgetting to use this to determine likelihood of teleportation by assigned group number or passenger number. Oops.
# first make a Tibble that just separates the PassengerId into Group# and Passenger#
train_group_numbers <- as.tibble(str_split(string = train_df$PassengerId, pattern ="_", simplify = T))
test_group_numbers <- as.tibble(str_split(string = test_df$PassengerId, pattern ="_", simplify = T))
#change column names
colnames(test_group_numbers)[1] <- "GroupNum"
colnames(train_group_numbers)[1] <- "GroupNum"
colnames(test_group_numbers)[2] <- "PsngrNum"
colnames(train_group_numbers)[2] <- "PsngrNum"
#Add these columns to the original dataframes
test_df <- test_df %>%
mutate(GroupNum = test_group_numbers$GroupNum,
PsngrNum = test_group_numbers$PsngrNum)
train_df <- train_df %>%
mutate(GroupNum = train_group_numbers$GroupNum,
PsngrNum = train_group_numbers$PsngrNum)
# reorder to put the new columns by the PassengerId
test_df <- test_df %>%
select(PassengerId, GroupNum, PsngrNum, Name, HomePlanet:VRDeck)
train_df <- train_df %>%
select(PassengerId, GroupNum, PsngrNum, Name, Transported, HomePlanet:VRDeck)
This operation will parse out deck assignment and ship side assignment from the Cabin vector. Perhaps where the passenger was sleeping (if they were) made a difference in their likelihood of being teleported
# first make a Tibble that just separates the Cabin assignments into Deck/Number/Side
train_cabin_assignments <- as.tibble(str_split(string = train_df$Cabin, pattern = "/", simplify = T))
test_cabin_assignments <- as.tibble(str_split(string = test_df$Cabin, pattern = "/", simplify = T))
#change column names
colnames(test_cabin_assignments)[1] <- "Deck"
colnames(train_cabin_assignments)[1] <- "Deck"
colnames(test_cabin_assignments)[2] <- "CabinNumber"
colnames(train_cabin_assignments)[2] <- "CabinNumber"
colnames(test_cabin_assignments)[3] <- "CabinSide"
colnames(train_cabin_assignments)[3] <- "CabinSide"
#Add these columns to the original dataframes
test_df <- test_df %>%
mutate(Deck = test_cabin_assignments$Deck,
CabinNumber = test_cabin_assignments$CabinNumber,
CabinSide = test_cabin_assignments$CabinSide)
train_df <- train_df %>%
mutate(Deck = train_cabin_assignments$Deck,
CabinNumber = train_cabin_assignments$CabinNumber,
CabinSide = train_cabin_assignments$CabinSide)
# reorder to put the new columns by the PassengerId
test_df <- test_df %>%
select(PassengerId:Cabin, Deck, CabinNumber, CabinSide, Destination:VRDeck)
train_df <- train_df %>%
select(PassengerId:Cabin, Deck, CabinNumber, CabinSide, Destination:VRDeck)
This opertion totals the spending of each passenger; perhaps big spenders were spared teleportation.
# First replace NAs with 0s
train_df <- train_df %>%
mutate(VRDeck = replace_na(VRDeck, 0))
test_df <- test_df %>%
mutate(VRDeck = replace_na(VRDeck, 0))
# Try to mutate a total column
train_df <- train_df %>%
group_by(PassengerId) %>%
mutate(Total_Debt = rowSums(across(RoomService:VRDeck)))
test_df <- test_df %>%
group_by(PassengerId) %>%
mutate(Total_Debt = rowSums(across(RoomService:VRDeck)))
ggplot(train_df, aes(HomePlanet, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
labs(title = "Teleported Status by Home Planet",
x = "Home Planet",
y = "Totals")
Majority of Europans and Martians were teleported
ggplot(train_df, aes(Deck, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", colour = "white", size = 2.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
labs(title = "Teleported Status by Deck Assignment",
x = "Deck Assignment",
y = "Totals")
Decks B, C, and G had a majority get teleported
ggplot(train_df, aes(CabinSide, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
labs(title = "Teleported Status by Side assignment",
x = "Side Assignment",
y = "Totals")
Majority assigned to Starboard side were teleported
ggplot(train_df, aes(CryoSleep, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
labs(title = "Teleported Status by Cryosleep Status",
x = "Cryo Sleep status",
y = "Totals")
Majority of those in Cryosleep were teleported
This operation will bin ages into groups to make the age variable more manageable
# make a column that bins ages in increments of 10 (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79)
train_df$AgeBins <- cut(train_df$Age, breaks = c(0, 9, 19, 29, 39, 49, 59, 69, 79),
labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79"))
test_df$AgeBins <- cut(test_df$Age, breaks = c(0, 9, 19, 29, 39, 49, 59, 69, 79),
labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79"))
# Then put them back in the order that is easier to read
train_df <- train_df %>%
select(PassengerId:Age, AgeBins, VIP:Total_Debt)
test_df <- test_df %>%
select(PassengerId:Age, AgeBins, VIP:Total_Debt)
ggplot(train_df, aes(AgeBins, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
labs(title = "Teleported Status by Age Range",
x = "Age Range",
y = "Totals")
0-19 and 40-49 year olds were most likely to be teleported
ggplot(train_df, aes(Destination, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", colour = "white", size = 3.5,
aes(label = ..count..),position = position_dodge(width = .9), show.legend = FALSE)+
labs(title = "Teleported Status by Destination",
x = "Destination",
y = "Totals")
Majority bound for 55 Cancri e were Teleported
ggplot(train_df, aes(VIP, fill = Transported))+
geom_bar(position = "dodge", stat = "count")+
stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
labs(title = "Teleported Status by VIP Status",
x = "VIP Status",
y = "Totals")
Slight majority of NON-VIP passengers teleported, majority VIP not
# some weird scaling issues with this chart
train_df %>%
group_by(Transported) %>%
summarise(Average_Spent = mean(Total_Debt)) %>%
mutate(Average_Spent = dollar(Average_Spent, accuracy = 1e-02)) %>%
ggplot(aes(Transported, Average_Spent, fill = Transported))+
geom_col()+
geom_label(aes(label = Average_Spent), color = "black", fill = "white", show.legend = FALSE, nudge_y = -0.1)+
labs(title = "Teleportation Status by Average Spent",
x = "Teleportation Status",
y = "Average Spent")
Those who were teleported spent on average $1100 less than those who were not
this operation creates a dataframe of multiple vectors specific to CryoSleep
Cryo_Odds <- train_df %>%
group_by(CryoSleep) %>%
mutate(Cryo_Total = n(),
Cryo_Transported = sum(Transported == TRUE),
Cryo_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, CryoSleep, Transported, Cryo_Total, Cryo_Transported, Cryo_Remained) %>%
group_by(CryoSleep) %>%
mutate(Cryo_odds = Cryo_Transported / Cryo_Total)
Cryo_Odds %>%
head(n = 10) %>%
kable()
| PassengerId | CryoSleep | Transported | Cryo_Total | Cryo_Transported | Cryo_Remained | Cryo_odds |
|---|---|---|---|---|---|---|
| 0001_01 | FALSE | FALSE | 5439 | 1789 | 3650 | 0.3289208 |
| 0002_01 | FALSE | TRUE | 5439 | 1789 | 3650 | 0.3289208 |
| 0003_01 | FALSE | FALSE | 5439 | 1789 | 3650 | 0.3289208 |
| 0003_02 | FALSE | FALSE | 5439 | 1789 | 3650 | 0.3289208 |
| 0004_01 | FALSE | TRUE | 5439 | 1789 | 3650 | 0.3289208 |
| 0005_01 | FALSE | TRUE | 5439 | 1789 | 3650 | 0.3289208 |
| 0006_01 | FALSE | TRUE | 5439 | 1789 | 3650 | 0.3289208 |
| 0006_02 | TRUE | TRUE | 3037 | 2483 | 554 | 0.8175831 |
| 0007_01 | FALSE | TRUE | 5439 | 1789 | 3650 | 0.3289208 |
| 0008_01 | TRUE | TRUE | 3037 | 2483 | 554 | 0.8175831 |
# do the same to add the new column to the original
train_df <- train_df %>%
group_by(CryoSleep) %>%
mutate(Cryo_Total = n(),
Cryo_Transported = sum(Transported == TRUE),
Cryo_Remained = sum(Transported == FALSE)) %>%
mutate(Cryo_odds = Cryo_Transported / Cryo_Total)
This operation simplifies the cryosleep data
Cryo_Odds_Simplified <- Cryo_Odds %>%
ungroup() %>%
select(CryoSleep, Cryo_odds) %>%
group_by(CryoSleep) %>%
summarise(Cryo_odds = mean(Cryo_odds, trim = 3)) %>%
mutate(across(Cryo_odds, round, 3))
kable(Cryo_Odds_Simplified,
col.names = c("CryoSleep", "Odds"),
align = "cc")
| CryoSleep | Odds |
|---|---|
| FALSE | 0.329 |
| TRUE | 0.818 |
| NA | 0.488 |
Visualizing CryoSleep data
ggplot(Cryo_Odds_Simplified, aes(CryoSleep, Cryo_odds))+
geom_col(aes(fill = CryoSleep, show.legend = FALSE)) +
geom_label(aes(label = Cryo_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Cryosleep Status",
x = "Cryosleep Status",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Those in cryosleep are more likely to be teleported
Planet_Odds <- train_df %>%
group_by(HomePlanet) %>%
mutate(Planet_Total = n(),
Planet_Transported = sum(Transported == TRUE),
Planet_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, HomePlanet, Transported, Planet_Total, Planet_Transported, Planet_Remained) %>%
group_by(HomePlanet) %>%
mutate(Planet_odds = Planet_Transported / Planet_Total)
Planet_Odds %>%
slice_sample(n = 3) %>%
kable()
| PassengerId | HomePlanet | Transported | Planet_Total | Planet_Transported | Planet_Remained | Planet_odds |
|---|---|---|---|---|---|---|
| 4661_02 | Earth | FALSE | 4602 | 1951 | 2651 | 0.4239461 |
| 6225_01 | Earth | FALSE | 4602 | 1951 | 2651 | 0.4239461 |
| 8319_01 | Earth | FALSE | 4602 | 1951 | 2651 | 0.4239461 |
| 7117_01 | Europa | TRUE | 2131 | 1404 | 727 | 0.6588456 |
| 4704_01 | Europa | FALSE | 2131 | 1404 | 727 | 0.6588456 |
| 9085_03 | Europa | TRUE | 2131 | 1404 | 727 | 0.6588456 |
| 5772_01 | Mars | FALSE | 1759 | 920 | 839 | 0.5230244 |
| 6253_01 | Mars | FALSE | 1759 | 920 | 839 | 0.5230244 |
| 4082_01 | Mars | FALSE | 1759 | 920 | 839 | 0.5230244 |
| 4078_01 | NA | FALSE | 201 | 103 | 98 | 0.5124378 |
| 6623_01 | NA | TRUE | 201 | 103 | 98 | 0.5124378 |
| 6954_02 | NA | FALSE | 201 | 103 | 98 | 0.5124378 |
This operation simplifies the new data
Planet_Odds_Simplified <- Planet_Odds %>%
ungroup() %>%
select(HomePlanet, Planet_odds) %>%
group_by(HomePlanet) %>%
summarise(Planet_odds = mean(Planet_odds, trim = 3)) %>%
mutate(across(Planet_odds, round, 3))
kable(Planet_Odds_Simplified,
col.names = c("Home Planet", "Odds"),
align = "cc")
| Home Planet | Odds |
|---|---|
| Earth | 0.424 |
| Europa | 0.659 |
| Mars | 0.523 |
| NA | 0.512 |
# do the same to add the new column to the original
train_df <- train_df %>%
group_by(HomePlanet) %>%
mutate(Planet_Total = n(),
Planet_Transported = sum(Transported == TRUE),
Planet_Remained = sum(Transported == FALSE)) %>%
mutate(Planet_odds = Planet_Transported / Planet_Total)
ggplot(Planet_Odds_Simplified, aes(HomePlanet, Planet_odds))+
geom_col(aes(fill = HomePlanet, show.legend = FALSE)) +
geom_label(aes(label = Planet_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Home Planet",
x = "Home Planet",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Europans are more likely to be teleported
Deck_Odds <- train_df %>%
group_by(Deck) %>%
mutate(Deck_Total = n(),
Deck_Transported = sum(Transported == TRUE),
Deck_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, Deck, Deck_Total, Transported, Deck_Transported, Deck_Remained) %>%
group_by(Deck) %>%
mutate(Deck_odds = Deck_Transported / Deck_Total)
Deck_Odds %>%
slice_sample(n = 1) %>%
kable(align = "ccccccc")
| PassengerId | Deck | Deck_Total | Transported | Deck_Transported | Deck_Remained | Deck_odds |
|---|---|---|---|---|---|---|
| 6827_02 | A | 256 | TRUE | 127 | 129 | 0.4960938 |
| 6590_01 | B | 779 | FALSE | 572 | 207 | 0.7342747 |
| 1452_04 | C | 747 | FALSE | 508 | 239 | 0.6800535 |
| 1357_01 | D | 478 | TRUE | 207 | 271 | 0.4330544 |
| 1316_02 | E | 876 | FALSE | 313 | 563 | 0.3573059 |
| 4145_01 | F | 2794 | FALSE | 1229 | 1565 | 0.4398712 |
| 1615_03 | G | 2559 | TRUE | 1321 | 1238 | 0.5162173 |
| 1071_01 | T | 5 | FALSE | 1 | 4 | 0.2000000 |
| 0310_01 | NA | 199 | FALSE | 100 | 99 | 0.5025126 |
# do the same to add the new column to the original
train_df <- train_df %>%
group_by(Deck) %>%
mutate(Deck_Total = n(),
Deck_Transported = sum(Transported == TRUE),
Deck_Remained = sum(Transported == FALSE)) %>%
mutate(Deck_odds = Deck_Transported / Deck_Total)
This operation simplifies the data
Deck_Odds_Simplified <- Deck_Odds %>%
ungroup() %>%
select(Deck, Deck_odds) %>%
group_by(Deck) %>%
summarise(Deck_odds = mean(Deck_odds, trim = 3)) %>%
mutate(across(Deck_odds, round, 3))
kable(Deck_Odds_Simplified,
align = "cc")
| Deck | Deck_odds |
|---|---|
| A | 0.496 |
| B | 0.734 |
| C | 0.680 |
| D | 0.433 |
| E | 0.357 |
| F | 0.440 |
| G | 0.516 |
| T | 0.200 |
| NA | 0.503 |
ggplot(Deck_Odds_Simplified, aes(Deck, Deck_odds))+
geom_col(aes(fill = Deck, show.legend = FALSE)) +
geom_label(aes(label = Deck_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Deck Assignment",
x = "Deck Assignment",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Those assigned to decks B, C, and G are most likely to be teleported
CabinSide_Odds <- train_df %>%
group_by(CabinSide) %>%
mutate(CabinSide_Total = n(),
CabinSide_Transported = sum(Transported == TRUE),
CabinSide_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, CabinSide, Transported, CabinSide_Total, CabinSide_Transported, CabinSide_Remained) %>%
group_by(CabinSide) %>%
mutate(CabinSide_odds = CabinSide_Transported / CabinSide_Total)
CabinSide_Odds %>%
slice_sample(n = 3) %>%
kable(col.names = c("PassengerID", "CabinSide", "Transported", "Total", "Teleported", "Remained", "Odds"),
align = "ccccccc")
| PassengerID | CabinSide | Transported | Total | Teleported | Remained | Odds |
|---|---|---|---|---|---|---|
| 4336_03 | TRUE | 199 | 100 | 99 | 0.5025126 | |
| 8770_03 | FALSE | 199 | 100 | 99 | 0.5025126 | |
| 1718_01 | TRUE | 199 | 100 | 99 | 0.5025126 | |
| 7679_01 | P | FALSE | 4206 | 1898 | 2308 | 0.4512601 |
| 9179_02 | P | FALSE | 4206 | 1898 | 2308 | 0.4512601 |
| 6976_01 | P | TRUE | 4206 | 1898 | 2308 | 0.4512601 |
| 0728_01 | S | TRUE | 4288 | 2380 | 1908 | 0.5550373 |
| 3542_01 | S | TRUE | 4288 | 2380 | 1908 | 0.5550373 |
| 0762_01 | S | FALSE | 4288 | 2380 | 1908 | 0.5550373 |
train_df <- train_df %>%
group_by(CabinSide) %>%
mutate(CabinSide_Total = n(),
CabinSide_Transported = sum(Transported == TRUE),
CabinSide_Remained = sum(Transported == FALSE)) %>%
mutate(CabinSide_odds = CabinSide_Transported / CabinSide_Total)
CabinSide_Odds_Simplified <- CabinSide_Odds %>%
ungroup() %>%
select(CabinSide, CabinSide_odds) %>%
group_by(CabinSide) %>%
summarise(CabinSide_odds = mean(CabinSide_odds, trim = 3)) %>%
mutate(across(CabinSide_odds, round, 3))
CabinSide_Odds_Simplified[1,1] <- NA
kable(CabinSide_Odds_Simplified,
col.names = c("CabinSide", "Odds"),
align = "cc")
| CabinSide | Odds |
|---|---|
| NA | 0.503 |
| P | 0.451 |
| S | 0.555 |
ggplot(CabinSide_Odds_Simplified, aes(CabinSide, CabinSide_odds))+
geom_col(aes(fill = CabinSide, show.legend = FALSE)) +
geom_label(aes(label = CabinSide_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Ship Side Assignment",
x = "Ship Side Assignment",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Passengers assigned to the starboard side are more likely to be teleported
VIP_Odds <- train_df %>%
group_by(VIP) %>%
mutate(VIP_Total = n(),
VIP_Transported = sum(Transported == TRUE),
VIP_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, VIP, Transported, VIP_Total, VIP_Transported, VIP_Remained) %>%
group_by(VIP) %>%
mutate(VIP_odds = VIP_Transported / VIP_Total)
VIP_Odds %>%
slice_sample(n = 3) %>%
kable(align = "ccccccc",
col.names = c("PassengerID", "VIP", "Transported", "Total", "Teleported", "Remained", "Odds"))
| PassengerID | VIP | Transported | Total | Teleported | Remained | Odds |
|---|---|---|---|---|---|---|
| 6599_01 | FALSE | TRUE | 8291 | 4198 | 4093 | 0.5063322 |
| 2683_01 | FALSE | TRUE | 8291 | 4198 | 4093 | 0.5063322 |
| 8570_01 | FALSE | FALSE | 8291 | 4198 | 4093 | 0.5063322 |
| 1816_01 | TRUE | TRUE | 199 | 76 | 123 | 0.3819095 |
| 1742_01 | TRUE | FALSE | 199 | 76 | 123 | 0.3819095 |
| 4543_01 | TRUE | FALSE | 199 | 76 | 123 | 0.3819095 |
| 8948_01 | NA | FALSE | 203 | 104 | 99 | 0.5123153 |
| 1509_01 | NA | FALSE | 203 | 104 | 99 | 0.5123153 |
| 8710_01 | NA | FALSE | 203 | 104 | 99 | 0.5123153 |
train_df <- train_df %>%
group_by(VIP) %>%
mutate(VIP_Total = n(),
VIP_Transported = sum(Transported == TRUE),
VIP_Remained = sum(Transported == FALSE)) %>%
mutate(VIP_odds = VIP_Transported / VIP_Total)
VIP_Odds_Simplified <- VIP_Odds %>%
ungroup() %>%
select(VIP, VIP_odds) %>%
group_by(VIP) %>%
summarise(VIP_odds = mean(VIP_odds, trim = 3)) %>%
mutate(across(VIP_odds, round, 3))
kable(VIP_Odds_Simplified,
align = "cc")
| VIP | VIP_odds |
|---|---|
| FALSE | 0.506 |
| TRUE | 0.382 |
| NA | 0.512 |
ggplot(VIP_Odds_Simplified, aes(VIP, VIP_odds))+
geom_col(aes(fill = VIP, show.legend = FALSE)) +
geom_label(aes(label = VIP_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by VIP Status",
x = "VIP Status",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
The peasants are more likely to be teleported, obviously.
AgeBins_Odds <- train_df %>%
group_by(AgeBins) %>%
mutate(AgeBins_Total = n(),
AgeBins_Transported = sum(Transported == TRUE),
AgeBins_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, Age, AgeBins, Transported, AgeBins_Total, AgeBins_Transported, AgeBins_Remained) %>%
group_by(AgeBins) %>%
mutate(AgeBins_odds = AgeBins_Transported / AgeBins_Total)
AgeBins_Odds %>%
slice_sample(n = 1) %>%
kable(align = "cccccccc",
col.names = c("PassengerID","Age", "AgeBin", "Transported", "Total", "Teleported", "Remained", "Odds"))
| PassengerID | Age | AgeBin | Transported | Total | Teleported | Remained | Odds |
|---|---|---|---|---|---|---|---|
| 8226_01 | 3 | 0-9 | TRUE | 501 | 339 | 162 | 0.6766467 |
| 0598_01 | 19 | 10-19 | FALSE | 1479 | 788 | 691 | 0.5327924 |
| 5758_01 | 22 | 20-29 | TRUE | 2762 | 1296 | 1466 | 0.4692252 |
| 4017_02 | 30 | 30-39 | TRUE | 1735 | 796 | 939 | 0.4587896 |
| 8032_02 | 40 | 40-49 | TRUE | 1048 | 531 | 517 | 0.5066794 |
| 3382_02 | 56 | 50-59 | TRUE | 557 | 275 | 282 | 0.4937163 |
| 1238_02 | 61 | 60-69 | TRUE | 208 | 99 | 109 | 0.4759615 |
| 3831_01 | 73 | 70-79 | FALSE | 46 | 20 | 26 | 0.4347826 |
| 5714_03 | 0 | NA | TRUE | 357 | 234 | 123 | 0.6554622 |
train_df <- train_df %>%
group_by(VIP) %>%
mutate(AgeBins_Total = n(),
AgeBins_Transported = sum(Transported == TRUE),
AgeBins_Remained = sum(Transported == FALSE)) %>%
mutate(AgeBins_odds = AgeBins_Transported / AgeBins_Total)
AgeBins_Odds_Simplified <- AgeBins_Odds %>%
ungroup() %>%
select(AgeBins, AgeBins_odds) %>%
group_by(AgeBins) %>%
summarise(AgeBins_odds = mean(AgeBins_odds, trim = 3)) %>%
mutate(across(AgeBins_odds, round, 3))
kable(AgeBins_Odds_Simplified,
align = "cc",
col.names = c("AgeBins", "Odds"))
| AgeBins | Odds |
|---|---|
| 0-9 | 0.677 |
| 10-19 | 0.533 |
| 20-29 | 0.469 |
| 30-39 | 0.459 |
| 40-49 | 0.507 |
| 50-59 | 0.494 |
| 60-69 | 0.476 |
| 70-79 | 0.435 |
| NA | 0.655 |
ggplot(AgeBins_Odds_Simplified, aes(AgeBins, AgeBins_odds))+
geom_col(aes(fill = AgeBins, show.legend = FALSE)) +
geom_label(aes(label = AgeBins_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Age",
x = "Age",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Children, teenagers, and the adults who act like them are more likely to be teleported.
Destination_Odds <- train_df %>%
group_by(Destination) %>%
mutate(Destination_Total = n(),
Destination_Transported = sum(Transported == TRUE),
Destination_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, Destination, Transported, Destination_Total, Destination_Transported, Destination_Remained) %>%
group_by(Destination) %>%
mutate(Destination_odds = Destination_Transported / Destination_Total)
Destination_Odds %>%
slice_sample(n = 3) %>%
kable(align = "ccccccc",
col.names = c("PassengerID", "Destination", "Transported", "Total", "Teleported", "Remained", "Odds"))
| PassengerID | Destination | Transported | Total | Teleported | Remained | Odds |
|---|---|---|---|---|---|---|
| 6744_03 | 55 Cancri e | FALSE | 1800 | 1098 | 702 | 0.6100000 |
| 8004_01 | 55 Cancri e | TRUE | 1800 | 1098 | 702 | 0.6100000 |
| 0221_01 | 55 Cancri e | FALSE | 1800 | 1098 | 702 | 0.6100000 |
| 0781_01 | PSO J318.5-22 | FALSE | 796 | 401 | 395 | 0.5037688 |
| 6120_01 | PSO J318.5-22 | TRUE | 796 | 401 | 395 | 0.5037688 |
| 5108_01 | PSO J318.5-22 | TRUE | 796 | 401 | 395 | 0.5037688 |
| 7020_01 | TRAPPIST-1e | TRUE | 5915 | 2787 | 3128 | 0.4711750 |
| 2558_01 | TRAPPIST-1e | TRUE | 5915 | 2787 | 3128 | 0.4711750 |
| 3103_01 | TRAPPIST-1e | FALSE | 5915 | 2787 | 3128 | 0.4711750 |
| 6889_01 | NA | FALSE | 182 | 92 | 90 | 0.5054945 |
| 0504_03 | NA | TRUE | 182 | 92 | 90 | 0.5054945 |
| 5859_01 | NA | TRUE | 182 | 92 | 90 | 0.5054945 |
# do the same to add the new column to the original
train_df <- train_df %>%
group_by(Destination) %>%
mutate(Destination_Total = n(),
Destination_Transported = sum(Transported == TRUE),
Destination_Remained = sum(Transported == FALSE)) %>%
mutate(Destination_odds = Destination_Transported / Destination_Total)
Destination_Odds_Simplified <- Destination_Odds %>%
ungroup() %>%
select(Destination, Destination_odds) %>%
group_by(Destination) %>%
summarise(Destination_odds = mean(Destination_odds, trim = 3)) %>%
mutate(across(Destination_odds, round, 3))
kable(Destination_Odds_Simplified,
align = "cc")
| Destination | Destination_odds |
|---|---|
| 55 Cancri e | 0.610 |
| PSO J318.5-22 | 0.504 |
| TRAPPIST-1e | 0.471 |
| NA | 0.505 |
ggplot(Destination_Odds_Simplified, aes(Destination, Destination_odds))+
geom_col(aes(fill = Destination, show.legend = FALSE)) +
geom_label(aes(label = Destination_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Destination",
x = "Destination",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Passengers bound for 55 Cancri e and PSO J318.5-22 are more likely to be teleported
This operation bins the total debt figures into manageable chunks
train_df$DebtBins <- cut(train_df$Total_Debt, breaks = c(-1, 1000, 5000, 10000, 20000, 30000, 100000), labels = c("0-1000", "1-5000", "5-10000", "5-20000", "20-30000", "30000+"))
test_df$DebtBins <- cut(test_df$Total_Debt, breaks = c(-1, 1000, 5000, 10000, 20000, 30000, 100000),labels = c("0-1000", "1-5000", "5-10000", "5-20000", "20-30000", "30000+"))
this operation creates a df just for binned debt totals and their odds
DebtBins_Odds <- train_df %>%
group_by(DebtBins) %>%
mutate(DebtBins_Total = n(),
DebtBins_Transported = sum(Transported == TRUE),
DebtBins_Remained = sum(Transported == FALSE)) %>%
ungroup() %>%
select(PassengerId, DebtBins, Transported, DebtBins_Total, DebtBins_Transported, DebtBins_Remained) %>%
group_by(DebtBins) %>%
mutate(DebtBins_odds = DebtBins_Transported / DebtBins_Total)
DebtBins_Odds %>%
slice_sample(n = 1) %>%
kable(align = "ccccccc",
col.names = c("PassengerID", "DebtBin", "Transported", "Total", "Teleported", "Remained", "Odds"))
| PassengerID | DebtBin | Transported | Total | Teleported | Remained | Odds |
|---|---|---|---|---|---|---|
| 4933_04 | 0-1000 | TRUE | 5860 | 3520 | 2340 | 0.6006826 |
| 1561_01 | 1-5000 | FALSE | 2196 | 653 | 1543 | 0.2973588 |
| 7833_01 | 5-10000 | FALSE | 456 | 150 | 306 | 0.3289474 |
| 2589_03 | 5-20000 | FALSE | 150 | 48 | 102 | 0.3200000 |
| 8970_01 | 20-30000 | FALSE | 27 | 5 | 22 | 0.1851852 |
| 8556_01 | 30000+ | FALSE | 4 | 2 | 2 | 0.5000000 |
train_df <- train_df %>%
group_by(DebtBins) %>%
mutate(DebtBins_Total = n(),
DebtBins_Transported = sum(Transported == TRUE),
DebtBins_Remained = sum(Transported == FALSE)) %>%
mutate(DebtBins_odds = DebtBins_Transported / DebtBins_Total)
DebtBins_Odds_Simplified <- DebtBins_Odds %>%
ungroup() %>%
select(DebtBins, DebtBins_odds) %>%
group_by(DebtBins) %>%
summarise(DebtBins_odds = mean(DebtBins_odds, trim = 3)) %>%
mutate(across(DebtBins_odds, round, 3))
kable(DebtBins_Odds_Simplified,
align = "cc",
col.names = c("DebBin", "Odds"))
| DebBin | Odds |
|---|---|
| 0-1000 | 0.601 |
| 1-5000 | 0.297 |
| 5-10000 | 0.329 |
| 5-20000 | 0.320 |
| 20-30000 | 0.185 |
| 30000+ | 0.500 |
ggplot(DebtBins_Odds_Simplified, aes(DebtBins, DebtBins_odds))+
geom_col(aes(fill = DebtBins, show.legend = FALSE)) +
geom_label(aes(label = DebtBins_odds, vjust = 1.5))+
labs(title = "Odds of Teleportation by Debt Amount",
x = "Debt Amount",
y = "Odds of Teleportation")+
theme(axis.text.y = element_text(size = 1.5))
Those who spent the least and those who spent the most were more likely to be teleported
this operation reorganzied the data for consumption
train_df <- train_df %>%
select(PassengerId:HomePlanet, Planet_Total:Planet_odds, CryoSleep:Cryo_odds, Cabin:CabinSide, Deck_Total:CabinSide_odds, Destination, Destination_Total:Destination_odds, Age:AgeBins, AgeBins_Total:AgeBins_odds, VIP, VIP_Total:VIP_odds, RoomService:Total_Debt, DebtBins:DebtBins_odds)
This operation gives a snapshot of the range of odds
train_total_odds <- train_df %>%
group_by(PassengerId) %>%
mutate(Odds = (Planet_odds * Cryo_odds * Deck_odds *CabinSide_odds * Destination_odds * AgeBins_odds * VIP_odds *DebtBins_odds)) %>%
select(PassengerId, Odds, Transported, GroupNum:Name, HomePlanet:DebtBins_odds)
summary(train_total_odds$Odds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0005464 0.0020066 0.0028990 0.0047252 0.0062640 0.0211128
To actually be Teleported, a passenger has to have at least a 0.00069 chance to be teleported at all
Train_Mean_Min_Max_Odds <- train_total_odds %>%
group_by(Transported) %>%
summarise(MeanOdds = mean(Odds),
MaxOdds = max(Odds),
MinOdds = min(Odds),
SDOdds = sd(Odds))
kable(Train_Mean_Min_Max_Odds,
align = "ccccc")
| Transported | MeanOdds | MaxOdds | MinOdds | SDOdds |
|---|---|---|---|---|
| FALSE | 0.0027124 | 0.0139331 | 0.0005464 | 0.0017756 |
| TRUE | 0.0067090 | 0.0211128 | 0.0006870 | 0.0049713 |
AgeBins Function this function will assign odds based on a passenger’s age range
age_bins_function <- function(x) {
case_when(x == "0-9" ~ 0.677,
x == "10-19" ~ 0.533,
x == "20-29" ~ 0.469,
x == "30-39" ~ 0.459,
x == "40-49" ~ 0.507,
x == "50-59" ~ 0.494,
x == "60-69" ~ 0.476,
x == "70-79" ~ 0.435,
x == "NA" ~ 0.655)
}
this operation adds the vector created by the function into the test dataframe
test_age_bins <- as.vector(lapply(test_df$AgeBins, age_bins_function))
# add it as a column to test_df
test_df$AgeBinsOdds <- test_age_bins
CabinSide Function
CabinSide_function <- function(x) {
case_when(x == NA ~ 0.503,
x == "P" ~ 0.451,
x == "S" ~ 0.555)
}
test_CabinSide <- as.vector(lapply(test_df$CabinSide, CabinSide_function))
# add it as a column to test_df
test_df$CabinSideOdds <- test_CabinSide
# for some reason, the newly created columns have to be assigned as doubles after being created
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds))
CryoSleep Function
CryoSleep_function <- function(x) {
case_when(x == "FALSE" ~ 0.329,
x == "TRUE" ~ 0.451,
x == NA ~ 0.488)
}
test_CryoSleep <- as.vector(lapply(test_df$CryoSleep, CryoSleep_function))
# add it as a column to test_df
test_df$CryoSleepOdds <- test_CryoSleep
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds),
CryoSleepOdds = as.double(CryoSleepOdds))
DebtBins Function
DebtBins_function <- function(x) {
case_when(x == "0-1000" ~ 0.601,
x == "1-5000" ~ 0.297,
x == "5-10000" ~ 0.329,
x == "5-2000" ~ 0.320,
x == "20-30000" ~ 0.185,
x == "30000+" ~ 0.500)
}
test_DebtBins <- as.vector(lapply(test_df$DebtBins, DebtBins_function))
# add it as a column to test_df
test_df$DebtBinsOdds <- test_DebtBins
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds),
CryoSleepOdds = as.double(CryoSleepOdds),
DebtBinsOdds = as.double(DebtBinsOdds))
Destination Function
Destination_function <- function(x) {
case_when(x == "55 Cancri e" ~ 0.610,
x == "PSO J318.5-22" ~ 0.504,
x == "TRAPPIST-1e" ~ 0.471,
x == NA ~ 0.505)
}
test_Destination <- as.vector(lapply(test_df$Destination, Destination_function))
# add it as a column to test_df
test_df$DestinationOdds <- test_Destination
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds),
CryoSleepOdds = as.double(CryoSleepOdds),
DebtBinsOdds = as.double(DebtBinsOdds),
DestinationOdds = as.double(DestinationOdds))
HomePlanet Function
HomePlanet_function <- function(x) {
case_when(x == "Earth" ~ 0.424,
x == "Europa" ~ 0.659,
x == "Mars" ~ 0.523,
x == NA ~ 0.512)
}
test_HomePlanet <- as.vector(lapply(test_df$HomePlanet, HomePlanet_function))
# add it as a column to test_df
test_df$HomePlanetOdds <- test_HomePlanet
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds),
CryoSleepOdds = as.double(CryoSleepOdds),
DebtBinsOdds = as.double(DebtBinsOdds),
DestinationOdds = as.double(DestinationOdds),
HomePlanetOdds = as.double(HomePlanetOdds))
Deck Function
Deck_function <- function(x) {
case_when(x == "A" ~ 0.496,
x == "B" ~ 0.734,
x == "C" ~ 0.680,
x == "D" ~ 0.433,
x == "E" ~ 0.357,
x == "F" ~ 0.440,
x == "G" ~ 0.516,
x == "T" ~ 0.200,
x == NA ~ 0.503)
}
test_Deck <- as.vector(lapply(test_df$Deck, Deck_function))
# add it as a column to test_df
test_df$DeckOdds <- test_Deck
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds),
CryoSleepOdds = as.double(CryoSleepOdds),
DebtBinsOdds = as.double(DebtBinsOdds),
DestinationOdds = as.double(DestinationOdds),
HomePlanetOdds = as.double(HomePlanetOdds),
DeckOdds = as.double(DeckOdds))
VIP function
VIP_function <- function(x) {
case_when(x == FALSE ~ 0.506,
x == TRUE ~ 0.382,
x == NA ~ 0.512)
}
test_VIP <- as.vector(lapply(test_df$VIP, VIP_function))
# add it as a column to test_df
test_df$VIPOdds <- unlist(test_VIP)
# For some reason all these columns must be transformed back into doubles
test_df <- test_df %>%
mutate(CabinSideOdds = as.double(CabinSideOdds),
AgeBinsOdds = as.double(AgeBinsOdds),
CryoSleepOdds = as.double(CryoSleepOdds),
DebtBinsOdds = as.double(DebtBinsOdds),
DestinationOdds = as.double(DestinationOdds),
HomePlanetOdds = as.double(HomePlanetOdds),
DeckOdds = as.double(DeckOdds),
VIPOdds = as.double(VIPOdds))
# this operation reorganized the data in the test df
test_df <- test_df %>%
select(PassengerId:HomePlanet, HomePlanetOdds, CryoSleep, CryoSleepOdds, Cabin, CabinNumber, Deck, DeckOdds, CabinSide, CabinSideOdds, Destination, DestinationOdds, Age:AgeBins, AgeBinsOdds, VIP, VIPOdds, RoomService:DebtBins, DebtBinsOdds)
To be teleported at all, a passenger’s odds must be at least 0.0007
test_total_odds <- test_df %>%
ungroup() %>%
group_by(PassengerId) %>%
mutate(Odds = (HomePlanetOdds * CryoSleepOdds * DeckOdds *CabinSideOdds * DestinationOdds * AgeBinsOdds * VIPOdds * DebtBinsOdds))
summary(test_total_odds$Odds)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0007 0.0017 0.0027 0.0032 0.0039 0.0120 659
Using the train_df data to determine what percentage of those with particular odds were actually teleported. This figure will be applied to those from the test df
train_total_odds %>%
filter(Odds >= 0.0005 & Odds <= 0.001) %>%
ggplot(aes(x = Transported, fill = Transported))+
geom_bar(stat = "count")+
stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
labs(title = "Total Teleported with Odds between 0.0005 and 0.001",
x = "Teleportation Status",
y = "Total Teleported")+
theme(axis.text.y = element_text(size = 1.5))
Approximatley 21% chance of teleportation
train_total_odds %>%
filter(Odds > 0.001 & Odds <= 0.005) %>%
ggplot(aes(x = Transported, fill = Transported))+
geom_bar(stat = "count")+
stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
labs(title = "Total Teleported with Odds between 0.001 and 0.005",
x = "Teleportation Status",
y = "Total Teleported")+
theme(axis.text.y = element_text(size = 1.5))
Approximately 34% chance of teleportation
train_total_odds %>%
filter(Odds > 0.005 & Odds <= 0.01) %>%
ggplot(aes(x = Transported, fill = Transported))+
geom_bar(stat = "count")+
stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
labs(title = "Total Teleported with Odds between 0.005 and 0.01",
x = "Teleportation Status",
y = "Total Teleported")+
theme(axis.text.y = element_text(size = 1.5))
Approximately 76% chance of teleportation
train_total_odds %>%
filter(Odds > 0.01 & Odds <= 0.05) %>%
ggplot(aes(x = Transported, fill = Transported))+
geom_bar(stat = "count")+
stat_count(geom = "label", color = "white", size = 3.5,
aes(label = ..count..), position = position_dodge(width = 0.9), show.legend = FALSE)+
labs(title = "Total Teleported with Odds between 0.01 and 0.05",
x = "Teleportation Status",
y = "Total Teleported")+
theme(axis.text.y = element_text(size = 1.5))
Approximately 99% chance of teleportation
This operation creates a function that will give a vector of liklihoods based on a passenger’s overall odds of teleportation
likelihood_function <- function(x) {
case_when((x >- 0.0005 & x <= 0.001) ~ 21,
(x > 0.001 & x <= 0.005) ~ 34,
(x > 0.005 & x <= 0.01) ~ 76,
(x > 0.01 & x <= 0.05) ~ 99)
}
test_Estimates <- as.vector(lapply(test_total_odds$Odds, likelihood_function))
# add it as a column to test_df
test_total_odds$Percent_Likelihood <- test_Estimates
# reorder columns to make checking easier
test_total_odds <- test_total_odds %>%
select(PassengerId, Odds, Percent_Likelihood, GroupNum:DebtBinsOdds)
# For some reason all these columns must be transformed back into doubles
test_total_odds <- test_total_odds %>%
mutate(Percent_Likelihood = as.double(Percent_Likelihood))
This operation simplifies for consumption
FINAL <- test_total_odds
passenger_likelihood <- FINAL %>%
select(PassengerId, Name, Percent_Likelihood)
passenger_likelihood %>%
slice_sample(n = 10) %>%
head(n = 12) %>%
kable(align = "ccc")
| PassengerId | Name | Percent_Likelihood |
|---|---|---|
| 0013_01 | Nelly Carsoning | 34 |
| 0018_01 | Lerome Peckers | 34 |
| 0019_01 | Sabih Unhearfus | 76 |
| 0021_01 | Meratz Caltilter | 34 |
| 0023_01 | Brence Harperez | 34 |
| 0027_01 | Karlen Ricks | 21 |
| 0029_01 | Aldah Ainserfle | 76 |
| 0032_01 | Acrabi Pringry | 34 |
| 0032_02 | Dhena Pringry | 76 |
| 0033_01 | Eliana Delazarson | 34 |
| 0037_01 | Vivia Rickson | 34 |
| 0040_01 | Antino Pinoffent | 34 |
sample visualiztion of simplified data
passenger_likelihood %>%
mutate(Percent_Likelihood = as.factor(Percent_Likelihood)) %>%
ggplot(aes(Percent_Likelihood, fill = Percent_Likelihood))+
geom_bar()+
stat_count(geom = "text", color = "light blue", size = 3.5, aes(label = ..count..), position = position_stack(vjust = 0.5))+
labs(title = "Total Possible Teleported Passengers by Likelihood",
x = "Percent Likelihood",
y = "Total Possible Teleported")+
theme(axis.text.y = element_text(size = 1.5))