First we will download the dataset from the Kaggle Website. The dataset are train.csv and test.csv.
Now we set the directory to the directory folder
setwd("C:/Users/la5w7/Desktop/Spring 2018/CS-5565-Intro-to-Statistical-Learning/Project")
Now I can read the csv file.
titanic.train <-read.csv(file="train.csv",stringsAsFactors = FALSE,header = TRUE)
Here we have used the stringAsFactors = FALSE as we want to do some manupulation on the dataset whereas by deafult the R will transform that dataset into dataframe and convert all strings into categories. We do not want that we will do some manupulation on the dataset.
We will the methodology that we will combine the dataset both in one file and clean them together rather than cleaning the train and test seperately.
str(titanic.test)
'data.frame': 418 obs. of 11 variables:
$ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
$ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
$ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
$ Sex : chr "male" "female" "male" "male" ...
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Ticket : chr "330911" "363272" "240276" "315154" ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Cabin : chr "" "" "" "" ...
$ Embarked : chr "Q" "S" "Q" "S" ...
you can see the survival is missing. So the aim of this project is to train the model of the dataset train and then apply that model on test dataset and compare the results with the dataset given and find the accuracy of our model.
Now we will combine the both files and clean the data. You can do the cleaning on both the test and train data alone.
After cleaning the data we will sepreate the data set again into train and test.
Now will make sure that number of columns on both dataset is equal.
names(titanic.train)
[1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
[6] "Age" "SibSp" "Parch" "Ticket" "Fare"
[11] "Cabin" "Embarked" "IsTrainSet"
ncol(titanic.train)
[1] 13
names(titanic.test)
[1] "PassengerId" "Pclass" "Name" "Sex" "Age"
[6] "SibSp" "Parch" "Ticket" "Fare" "Cabin"
[11] "Embarked" "IsTrainSet"
ncol(titanic.test)
So we will add column Survived in the test so that the number and name of both test and train be equal.
Now we will combined both dataset
Later we will split the data into train and test again. Now we will clean the data by filling out the empty details.
table(titanic.full$Embarked)
C Q S
2 270 123 914
so there are three types. C Q and S (which also mode). Let first make a filter where can find the empty values and fill those values with S.
titanic.full[titanic.full$Embarked=='',"Embarked"]<-'S'
Now if we table again the Embarked column we should not get any missing values. Now look at the age
table(is.na(titanic.full$Age))
FALSE TRUE
1046 263
We can see that there are 263 values are missing. We will find the mean of the age values and then replace the missing age values with mean of ages column.
Now if we see the missing values of Age colum we will not be able to see any missing values.
table(is.na(titanic.full$Age))
FALSE
1309
There is another colum having missing values, that is Fare column.
table(is.na(titanic.full$Fare))
FALSE
1309
Before splitting the data we will factorize the things
Now we will split the data back in train and test dataset.
So now we will factor the Survived column also
titanic.train$Survived <- as.factor(titanic.train$Survived)
so this shows that this is supervized binary classification prediction problem. Let first do the random forest model.
Now we will apply the model on the train dataset.
titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))
The response has five or fewer unique values. Are you sure you want to do regression?
Now we will apply model to the test dataset.
write.csv(output.df, file = "kaggle_submission.csv", row.name = FALSE)
Error in write.table(output.df, file = "kaggle_submission.csv", row.name = FALSE, :
'col.names = NA' makes no sense when 'row.names = FALSE'
Since Kaggle want the answer to submit in the particular format in which there are two columns. PassengerId and Survived.
Now we will find predictive model for missing data. For Fare we will make a regression model
setwd("C:/Users/la5w7/Desktop/Spring 2018/CS-5565-Intro-to-Statistical-Learning/Project")
titanic.train <-read.csv(file="train.csv",stringsAsFactors = FALSE)
titanic.test <-read.csv(file="test.csv",stringsAsFactors = FALSE)
#str(titanic.test)
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#names(titanic.train)
#ncol(titanic.train)
#names(titanic.test)
#ncol(titanic.test)
titanic.test$Survived <- NA
titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='',"Embarked"]<-'S'
titanic.full$Pclass <- as.factor(titanic.full$Pclass)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)
First let check how many missing values are there in Fare. First let see the boxplot and see how many values are out of box

So we can see that there are lot of values which are outlined. So will get that data from boxplot.stat
boxplot.stats(titanic.full$Fare)$stats[5]
[1] 65
and make a filter as outlier.filter <-
now we will predict the values of fare accroding to the model trained for Fare prediction based on fare.equation.
titanic.full[is.na(titanic.full$Fare),"Fare"]
numeric(0)
Now we will make regression model for age prediction
boxplot(titanic.full$Age)

upper.whisker <- boxplot.stats(titanic.full$Age)$stats[5]
age.mean <- mean(titanic.full$Age, na.rm = TRUE)
titanic.full[is.na(titanic.full$Age),"Age"] <-age.mean
outlier.filter<- titanic.full$Age < upper.whisker
titanic.full[outlier.filter,]
age.equation = "Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked"
age.model <- lm(formula = age.equation, data = titanic.full[outlier.filter,])
age.row <- titanic.full[is.na(titanic.full$Age),c("Pclass", "Sex" ,"SibSp","Parch","Fare","Embarked")]
age.predictions <-predict(age.model,newdata = age.row)
titanic.full[is.na(titanic.full$Age),"Fare"] <- age.predictions
titanic.train <-titanic.full[titanic.full$IsTrainSet == TRUE, ]
titanic.test <-titanic.full[titanic.full$IsTrainSet == FALSE, ]
titanic.train$Survived <- as.factor(titanic.train$Survived)
survived.equation <- "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked "
survived.formula <- as.formula(survived.equation)
#install.packages("randomForest")
library(randomForest)
package <U+393C><U+3E31>randomForest<U+393C><U+3E32> was built under R version 3.4.4randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))
features.equation <- "Pclass + Sex + Age + SibSp + Parch +Fare +Embarked"
Survived <- predict(titanic.model, newdata = titanic.test)
PassengerId <- titanic.test$PassengerId
output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived
write.csv(output.df, file = "kaggle_submission1.csv", row.names = FALSE)
