The Titanic dataset provides 11 predictors and 1 binary response variable: Survival. The ultimate goal is to predict survivorship based on characteristics of the passenger.
In this Exploratory Data Analysis, I will examine the relationships between predictors and the response variable. I will use feature engineering to identify new predictors that may contribute to the analysis. And I will handle missing values and outliers as appropriate. The outputs of this analysis will be a preprocessed dataset, ready for machine learning algorithms.
library(dplyr)
library(ggplot2)
library(tidyverse)
library(RColorBrewer)
library(scales)
library(ggpubr)
train <- read_csv('train.csv')
test <- read_csv('test.csv') #test has 1 less col
test$Survived <- NA
original <- rbind(train,test)
titanic <- original
Let’s begin by getting a glimpse at the data:
head(titanic)
## # A tibble: 6 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Brau~ male 22 1 0 A/5 2~ 7.25 <NA>
## 2 2 1 1 Cumi~ fema~ 38 1 0 PC 17~ 71.3 C85
## 3 3 1 3 Heik~ fema~ 26 0 0 STON/~ 7.92 <NA>
## 4 4 1 1 Futr~ fema~ 35 1 0 113803 53.1 C123
## 5 5 0 3 Alle~ male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Mora~ male NA 0 0 330877 8.46 <NA>
## # ... with 1 more variable: Embarked <chr>
summary(titanic)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin Embarked
## Length:1309 Min. : 0.000 Length:1309 Length:1309
## Class :character 1st Qu.: 7.896 Class :character Class :character
## Mode :character Median : 14.454 Mode :character Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
We see missing data for a few features. Let’s filter them out:
#convert these variables to factors, easier to plot
factors <- c('Survived', 'Pclass', 'Sex', 'Cabin', 'Embarked')
titanic[factors] <- lapply(titanic[factors], function(x) as.factor(x))
#missing values to keep in mind
titanic %>%
summarize_all(~(sum(is.na(.))))
## # A tibble: 1 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 418 0 0 0 263 0 0 0 1 1014
## # ... with 1 more variable: Embarked <int>
With this in mind, let’s begin our variable exploration!
PassengerId - unique to each passenger
Survived - dependent variable/ targets
There are three levels: 1, 2, 3
Q: How is Pclass distributed onboard?
titanic %>%
count(Pclass) %>%
mutate(proportion = round(n/sum(n)*100))
## # A tibble: 3 x 3
## Pclass n proportion
## <fct> <int> <dbl>
## 1 1 323 25
## 2 2 277 21
## 3 3 709 54
Q: Are there more men or women in each Pclass?
titanic %>%
ggplot(aes(x=Pclass, fill=Sex)) +
geom_bar(position = "dodge", width = 0.4)
Q: Survival rate? Were those who paid more prioritized for rescue?
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Pclass, fill=Survived))+
geom_bar(position='dodge')+
scale_fill_brewer(palette = 'Set1')
format: {LastName, Title. FirstName MiddleName}
head(titanic$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
Individual names are unique, but we can extract the Title to help identify patterns based on this variable
Q: What is the frequency of each Title?
titanic$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)
table(titanic$Sex, titanic$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs
## female 0 0 0 1 1 0 1 0 0 260 2 1 0 197
## male 1 4 1 0 7 1 0 2 61 0 0 0 757 0
##
## Ms Rev Sir the Countess
## female 2 0 0 1
## male 0 8 1 0
There are some Titles that are more one-off. And ‘Mlle’ and ‘Miss’ refer to the same thing. To avoid overfitting, let’s consolidate these Titles into fewer groups.
titanic <- titanic %>%
mutate(Title = case_when(Title %in% c('Master', 'Miss', 'Mr', 'Mrs') ~ Title,
Title %in% c('Mlle', 'Mme', 'Ms') ~ "Miss",
TRUE ~ "Other"))
table(titanic$Sex, titanic$Title)
##
## Master Miss Mr Mrs Other
## female 0 265 0 197 4
## male 61 0 757 0 25
Q: Survival rate based on Title?
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Title, fill=Survived))+
geom_bar(position='dodge')+
scale_fill_brewer(palette = 'Set1')
Q: A trans-atlantic sounds rough. Were more men than women who were making it?
count(titanic, Sex) %>%
mutate(prop = round(n/sum(n)*100)) %>%
arrange(desc(prop))
## # A tibble: 2 x 3
## Sex n prop
## <fct> <int> <dbl>
## 1 male 843 64
## 2 female 466 36
Q: We already suspect that women have higher rate of survival. But what if we holding Pclass constant, was survival rate still higher for women?
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Sex, fill=Survived)) +
geom_bar(position='dodge') +
scale_fill_brewer(palette = 'Set1') +
facet_wrap(~Pclass, nrow=3, scales = "free_y")
Sibsp = siblings and spouses onboard Parch = parents and children onboard They both tell us about how having family with you impacts survival. Let’s combine these into a new feature: Family
Q: Did having other family members impact your survival?
titanic$Family = titanic$SibSp + titanic$Parch + 1
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Family, fill=Survived)) +
geom_bar(position='dodge') +
scale_fill_brewer(palette = 'Set1')
There were 1309 passengers with 929 unique tickets
Let’s use Ticke to create a variable: N_Per_Ticket
#Count of how many passengeres per ticket
n_per_ticket <- titanic %>%
group_by(Ticket) %>%
count()
#merge onto the main dataset
titanic <- merge(x=n_per_ticket, y=titanic,
by.x="Ticket", by.y="Ticket",
all.x=TRUE, all.y=TRUE)
colnames(titanic)[colnames(titanic) == "n"] <- "N_Per_Ticket"
Q: Are people on the same ticket because they’re from the same family?
titanic %>%
filter(N_Per_Ticket != Family) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 284
It is not clear what matters more: how big your family is (Sipsp+Parch), or how large your travel group is. Let’s create another variable Groups to be the larger of Family and N_Per_Ticket.
titanic <- titanic %>%
mutate(Groups = ifelse(Family > N_Per_Ticket, Family, N_Per_Ticket))
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Groups, fill=Survived)) +
geom_bar(position='dodge') +
scale_fill_brewer(palette = 'Set1')
The distribution is similar to what we saw for Family. Let’s consolidate everyone into based on a new variable: Group_Size.
titanic <- titanic %>%
mutate(Group_Size = case_when(Groups == 1 ~ 1,
Groups == 2 ~ 2,
Groups > 2 & Groups < 5 ~ 3,
Groups >= 5 ~ 4))
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Group_Size, fill=Survived)) +
geom_bar(position='dodge') +
scale_fill_brewer(palette = 'Set1')
Does Fare refer to price for the passenger, or the ticket?
titanic %>%
filter(Pclass==3) %>%
ggplot(aes(x=N_Per_Ticket, y=Fare)) +
geom_jitter(color="blue", size=2, alpha=0.3)
## Warning: Removed 1 rows containing missing values (geom_point).
This has caused outliers in Fare. For example, there are Fare in Pclass 3 that are much higher than Pclass 2.
titanic %>%
filter(Fare<50) %>%
ggplot(aes(x=Fare, y=Pclass, fill=Pclass))+
geom_boxplot()
We can combine Fare and N_Per_Person to create a new feature: Fare_Per_Person.
Let’s check the distribution for Pclass 3 again.
titanic$Fare_Per_Person <- titanic$Fare/titanic$N_Per_Ticket
titanic %>%
filter(Pclass==3) %>%
ggplot(aes(x=N_Per_Ticket, y=Fare_Per_Person)) +
geom_jitter(color="blue", size=2, alpha=0.3)
## Warning: Removed 1 rows containing missing values (geom_point).
That looks like a more useful variable for us. We can isolate the relationship between who much you paid, and your surival rate.
Now here is something interesting - there are people who had 0 Fare. Maybe they got a free ticket. For the purposes of this analysis, we should change their Fare_Per_Ticket to median of their Pclass.
q1 <- quantile(filter(titanic, Pclass==1)$Fare_Per_Person)
q2 <- quantile(filter(titanic, Pclass==2)$Fare_Per_Person)
q3 <- quantile(filter(titanic, Pclass==3)$Fare_Per_Person, na.rm = TRUE)
titanic <- titanic %>%
mutate(Fare_Per_Person = case_when(Fare_Per_Person == 0 & Pclass == 1 ~ q1[3],
Fare_Per_Person == 0 & Pclass == 2 ~ q2[3],
Fare_Per_Person == 0 & Pclass == 3 ~ q3[3],
TRUE ~ Fare_Per_Person))
While we’re at it, let’s address our missing value in Fare. One way to deal with missing values is to delete the row. But in this case, it is coming from the test set. We will have to make predictions about it.
Filtering it out, we see that this person traveled alone. We can then set his Fare_Per_Person to the median of Pclass 3.
filter(titanic, is.na(Fare))
## Ticket N_Per_Ticket PassengerId Survived Pclass Name Sex Age
## 1 3701 1 1044 <NA> 3 Storey, Mr. Thomas male 60.5
## SibSp Parch Fare Cabin Embarked Title Family Groups Group_Size
## 1 0 0 NA <NA> S Mr 1 1 1
## Fare_Per_Person
## 1 NA
titanic$Fare_Per_Person[is.na(titanic$Fare_Per_Person)] <- q3[3]
There are still outliers.
titanic %>%
ggplot(aes(x=Fare_Per_Person, y=Pclass, fill=Pclass))+
geom_boxplot()
Looking at this dataset, it seems that Ensemble Tree algorithms could be a good fit for it. Those are relatively immune to outliers. I think it’s best to leave Fare_Per_Person as is. It is a significant improvement over the orignal Fare already.
Another thing we can do is create Fare_Groups. For algorithms that are sensitive to outliers, this can be a good solution.
Q: Survival rate by Fare_Group?
* we see a clearer pattern that higher fare means higher survival rate
cuts <- c(-Inf, 7, 8, 10.5, 13, 26, 39, Inf)
labs <- c("0", "1", "2", "3", "4", "5", "6")
titanic <- titanic %>%
mutate(Fare_Groups = cut(Fare_Per_Person, breaks = cuts, labels=labs))
titanic$Fare_Groups <- as.factor(titanic$Fare_Groups)
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Fare_Groups, fill=Survived)) +
geom_bar(position = 'dodge')+
scale_fill_brewer(palette = 'Set1')
There are 2 missing values:
filter(titanic, is.na(Embarked))
## Ticket N_Per_Ticket PassengerId Survived Pclass
## 1 113572 2 830 1 1
## 2 113572 2 62 1 1
## Name Sex Age SibSp Parch Fare Cabin
## 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 80 B28
## 2 Icard, Miss. Amelie female 38 0 0 80 B28
## Embarked Title Family Groups Group_Size Fare_Per_Person Fare_Groups
## 1 <NA> Mrs 1 2 2 40 6
## 2 <NA> Miss 1 2 2 40 6
These 2 share the same Ticket and Cabin. Since they’re traveling together, they likely have embarked together.
Given that they’re in the Pclass 1, where are they most likely to have embarked?
* At ‘S’
titanic %>%
filter(Pclass==1 & !is.na(Embarked)) %>%
count(Embarked) %>%
mutate(proportion=n/sum(n))
## # A tibble: 3 x 3
## Embarked n proportion
## <fct> <int> <dbl>
## 1 C 141 0.439
## 2 Q 3 0.00935
## 3 S 177 0.551
We also know that their Fare_Per_Person was 40. This is a pretty high price for Pclass 2. Does where you embark change how much you paid?
* We see that given they embarked at ‘C’, they were more likely to have paid the higher price. But since probability of boarding at ‘S’ is higher, ‘S’ remains more likely.
titanic %>%
filter(Pclass==1& (Embarked=='C' | Embarked=='S')) %>%
ggplot(aes(x=Embarked, y=Fare_Per_Person, fill=Embarked))+
geom_boxplot() +
ylim(20, 60)
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
#set NA to 'S'
titanic$Embarked[is.na(titanic$Embarked)] <- 'S'
Survival rate?
* surprisingly, there does seem to be a some kind of relationship
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Embarked, fill=Survived)) +
geom_bar(position = 'dodge')+
scale_fill_brewer(palette = 'Set1')
format: {letter+numbers}
head(titanic$Cabin)
## [1] B77 B77 B79 E68 E67 E67
## 186 Levels: A10 A11 A14 A16 A18 A19 A20 A21 A23 A24 A26 A29 A31 A32 A34 ... T
There are 1014 missing values of Cabin, or 80% of the dataset. The best way to handle this may be throwing this variable away. But first, let’s try to figure out what information we’re missing exactly.
This is a subtle distinction, let’s see if we can find some additional information online.
https://titanic.fandom.com/wiki/Third_Class_cabins
https://titanic.fandom.com/wiki/First_Class_Staterooms
The most information: “steerage” (open-space dorms) did not apply to Titanic. All passengers were housed in cabins. There option 2 was the correct answer.
For those passengers that we have the data for, Cabin provides potentially important information about the Deck the passenger’s accomodation is on. Since lower decks were flooded first as the ship sinked, this mattered. Let’s extract Deck from Cabin. But keep in mind that this variable likely will lead to overfitting. We will have to test the models and see.
The format is {‘A10’}. Are there any that did not follow this rule?
1) – within’F’, there are a few that appear different. It is likely that the ‘F’ is extract, so what remains follows our normal format.
titanic %>%
filter((!Pclass==1) & !is.na(Cabin) & substr(Cabin, start=1, stop=2)=='F ') %>%
select(PassengerId, Sex, Age, Pclass, Cabin)
## PassengerId Sex Age Pclass Cabin
## 1 1213 male 25 3 F E57
## 2 1180 male NA 3 F E46
## 3 129 female NA 3 F E69
## 4 700 male 42 3 F G63
## 5 949 male 25 3 F G63
## 6 76 male 25 3 F G73
## 7 716 male 19 3 F G73
titanic <- titanic %>%
mutate(Deck = case_when(is.na(Cabin) ~ "Unknown",
substr(Cabin, start=1, stop=1)=="T" ~ "Unknown",
substr(Cabin, start=1, stop=2)=='F ' ~ substr(Cabin, start=3, stop=3),
TRUE ~ substr(Cabin, start=1, stop=1)
))
#see the resulting variable -
count(titanic, Deck)
## # A tibble: 8 x 2
## Deck n
## <chr> <int>
## 1 A 22
## 2 B 65
## 3 C 94
## 4 D 46
## 5 E 44
## 6 F 14
## 7 G 9
## 8 Unknown 1015
Also note most passengers with Deck data are Pclass 1. If this variable is useful, it would be for Pclass 1.
titanic %>%
filter(Deck!='Unknown') %>%
count(Pclass)
## # A tibble: 3 x 2
## Pclass n
## <fct> <int>
## 1 1 255
## 2 2 23
## 3 3 16
Survival rate & Deck (ignore missing data and only Pclass==1)?
* Deck A actually had the worst survival rate
titanic %>%
filter(!is.na(Survived) & Deck!='Unknown' & Pclass==1) %>%
ggplot(aes(x=Deck, fill=Survived)) +
geom_bar(position = 'dodge')+
scale_fill_brewer(palette = 'Set1')
There are 263 missing values of Cabin, or 20% of the dataset. This information is potentally impactful, so we should do our best to replace it. The median age of the dataset is 28.
Our Title variable perhaps can help. Miss vs Mrs, and Mr vs Master, they have an age component to it. We also can guess that older people can afford to be in better Pclass. Let’s use these two variables, and assign missing Age to the median of these groups.
#lookup table
lookup <- titanic %>%
group_by(Pclass, Title) %>%
summarize(med_age = median(Age, na.rm=TRUE))
#rows with missing ages
missing_ages <- titanic %>% filter(is.na(Age))
#merge the two
temp <- merge(x=missing_ages, y=lookup,
by = c('Pclass', 'Title'),
all.x=TRUE)
temp <- temp %>% select(PassengerId, med_age)
titanic <- merge(x=titanic, y=temp,
by = 'PassengerId',
all.x=TRUE)
#add med_age to original dataset
titanic <- titanic %>%
mutate(med_age = case_when(is.na(med_age) ~ Age,
TRUE ~ med_age))
#confirm that Age and our best guesses share the same median
titanic %>%
group_by(Pclass, Title) %>%
summarize(n =n(),
missing= sum(is.na(Age)),
prop_missing = round(missing/n*100),
age = median(Age, na.rm=TRUE),
new = median(med_age, na.rm=TRUE))
## # A tibble: 14 x 7
## # Groups: Pclass [3]
## Pclass Title n missing prop_missing age new
## <fct> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 1 Master 5 0 0 6 6
## 2 1 Miss 63 1 2 30 30
## 3 1 Mr 159 27 17 41.5 41.5
## 4 1 Mrs 77 10 13 45 45
## 5 1 Other 19 1 5 48.5 48.5
## 6 2 Master 11 0 0 2 2
## 7 2 Miss 51 2 4 20 20
## 8 2 Mr 150 13 9 30 30
## 9 2 Mrs 55 1 2 30.5 30.5
## 10 2 Other 10 0 0 41.5 41.5
## 11 3 Master 45 8 18 6 6
## 12 3 Miss 151 48 32 18 18
## 13 3 Mr 448 136 30 26 26
## 14 3 Mrs 65 16 25 31 31
#replace Age
titanic$Age <- titanic$med_age
Given that we replaced 20% of the Age data, and there are still outliers, perhaps bucketing Age could be useful.
titanic <- titanic %>%
mutate(Age_Group = case_when(Age < 16 ~ "0-15",
Age >= 16 & Age < 20 ~ "16-19",
Age >= 20 & Age < 24 ~ "20-23",
Age >= 24 & Age < 30 ~ "24-29",
Age >= 30 & Age < 40 ~ "30-39",
Age >= 40 & Age < 55 ~ "40-54",
Age >= 55 ~ "55+"))
titanic %>% count(Age_Group)
## # A tibble: 7 x 2
## Age_Group n
## <chr> <int>
## 1 0-15 123
## 2 16-19 158
## 3 20-23 138
## 4 24-29 344
## 5 30-39 263
## 6 40-54 216
## 7 55+ 67
titanic$Age_Group <- as.factor(titanic$Age_Group)
Survival rate by Age?
* the group with the best survival rate were children 15 and under * the worst were people in their 20s * surprisingly, people 55+ were not really better off; given that they seemed to should have been prioritized in the rescue
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x = Age_Group, fill=Survived))+
geom_bar(position='dodge')+
scale_fill_brewer(palette = 'Set1')
Did the fate of other people in your group impact your survival? ‘group’ here means those who were on the same ticket as you. Identifying family members based on last name were difficult, and I have decided to use ticket for simplicity. If this variable appears to be important after running feature_importance_ on models, we can return to do more work on it.
Who are your companions?
Companions <- titanic %>%
group_by(Ticket) %>%
summarize(n = n(),
Died = sum(Survived==0, na.rm = T),
Lived = sum(Survived==1, na.rm = T),
Known_Fate = Died + Lived) %>%
arrange(desc(n))
Companions
## # A tibble: 929 x 5
## Ticket n Died Lived Known_Fate
## <chr> <int> <int> <int> <int>
## 1 CA. 2343 11 7 0 7
## 2 1601 8 2 5 7
## 3 CA 2144 8 6 0 6
## 4 3101295 7 6 0 6
## 5 347077 7 1 3 4
## 6 347082 7 7 0 7
## 7 PC 17608 7 0 2 2
## 8 S.O.C. 14879 7 5 0 5
## 9 113781 6 2 2 4
## 10 19950 6 2 2 4
## # ... with 919 more rows
For example: Ticket CA. 2343
* there are 11 people who share this ticket
* of those, we know the fate of 7 (the other 5 are in the test set)
* of those 7, all of them died
* could this mean that the rest suffered the same fate?
Let’s go through and add a Fate variable to our dataset. For each passenger, if she is in the train test set, we should ignore her own fate. Since that would mean using the dependent variable to engineer an independent variable.
There are 3 levels.
* 0.5 = default * 1 = better fate (other people in your group, when fate is known, more often lived) * 0 = worse fate (other people in your group, when fate is known, more often died)
Survival rate based on fate? * as expected, if other members of your group surived, your chances were higher as well
full <- merge(x=titanic, y=Companions,
by.x="Ticket", by.y="Ticket",
all.x=TRUE, all.y=TRUE)
full$Fate <- NA
trainfate <- full %>%
filter(!is.na(Survived)) %>%
mutate(Known_Fate = Known_Fate-1,
Fate = case_when(Known_Fate==0 ~ 0.5,
n==1 ~ 0.5,
n>1 & Survived==1 ~ (Lived-1)/Known_Fate,
n>1 & Survived==0 ~ Lived/Known_Fate))
testfate <- full %>%
filter(is.na(Survived)) %>%
mutate(Fate = case_when(Known_Fate==0 ~ 0.5,
n==1 ~ 0.5,
n>1 ~ Lived/Known_Fate))
full <- rbind(trainfate, testfate)
full <- full %>% arrange(PassengerId)
full <- full %>%
mutate(Fate = case_when(Fate < 0.5 ~ 0,
Fate > 0.5 ~ 1,
TRUE ~ Fate))
titanic$Fate <- as.factor(full$Fate)
titanic %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x=Fate, fill=Survived))+
geom_bar(position='dodge')+
scale_fill_brewer(palette = 'Set1')
We can go through the variables and delete the ones we do not need.
Here is our preprocessed data, ready for the next stage!
titanic$Name <- NULL
titanic$SibSp <- NULL
titanic$Parch <- NULL
titanic$Cabin <- NULL
titanic$N_Per_Ticket <- NULL
titanic$Fare <- NULL
titanic$Family_Size <- NULL
titanic$Family <- NULL
titanic$Ticket <- NULL
titanic$Groups <- NULL
titanic$med_age <- NULL
str(titanic)
## 'data.frame': 1309 obs. of 13 variables:
## $ PassengerId : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 26 54 2 27 14 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Title : chr "Mr" "Mrs" "Miss" "Mrs" ...
## $ Group_Size : num 2 2 1 2 1 1 2 4 3 2 ...
## $ Fare_Per_Person: num 7.25 35.64 7.92 26.55 8.05 ...
## $ Fare_Groups : Factor w/ 7 levels "0","1","2","3",..: 2 6 2 6 3 3 5 1 1 5 ...
## $ Deck : chr "Unknown" "C" "Unknown" "C" ...
## $ Age_Group : Factor w/ 7 levels "0-15","16-19",..: 3 5 4 5 5 4 6 1 4 1 ...
## $ Fate : Factor w/ 3 levels "0","0.5","1": 2 2 2 1 2 2 2 1 3 1 ...
#export file:
ttrain <- titanic[1:891,]
ttest <- titanic[892:1309,]
write.csv(ttrain,"preprocessed_train.csv", row.names = FALSE)
write.csv(ttest,"preprocessed_test.csv", row.names = FALSE)