First we will download the dataset from the Kaggle Website. The dataset are train.csv and test.csv.

Now we set the directory to the directory folder

setwd("C:/Users/la5w7/Desktop/Spring 2018/CS-5565-Intro-to-Statistical-Learning/Project")

Now I can read the csv file.

titanic.train <-read.csv(file="train.csv",stringsAsFactors = FALSE,header = TRUE)

Here we have used the stringAsFactors = FALSE as we want to do some manupulation on the dataset whereas by deafult the R will transform that dataset into dataframe and convert all strings into categories. We do not want that we will do some manupulation on the dataset.

We will the methodology that we will combine the dataset both in one file and clean them together rather than cleaning the train and test seperately.

str(titanic.test)
'data.frame':   418 obs. of  11 variables:
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
 $ Sex        : chr  "male" "female" "male" "male" ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : chr  "" "" "" "" ...
 $ Embarked   : chr  "Q" "S" "Q" "S" ...

you can see the survival is missing. So the aim of this project is to train the model of the dataset train and then apply that model on test dataset and compare the results with the dataset given and find the accuracy of our model.

Now we will combine the both files and clean the data. You can do the cleaning on both the test and train data alone.

After cleaning the data we will sepreate the data set again into train and test.

Now will make sure that number of columns on both dataset is equal.

names(titanic.train)
 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"    "IsTrainSet" 
ncol(titanic.train)
[1] 13
names(titanic.test)
 [1] "PassengerId" "Pclass"      "Name"        "Sex"         "Age"        
 [6] "SibSp"       "Parch"       "Ticket"      "Fare"        "Cabin"      
[11] "Embarked"    "IsTrainSet" 
ncol(titanic.test)

So we will add column Survived in the test so that the number and name of both test and train be equal.

Now we will combined both dataset

Later we will split the data into train and test again. Now we will clean the data by filling out the empty details.

table(titanic.full$Embarked)

      C   Q   S 
  2 270 123 914 

so there are three types. C Q and S (which also mode). Let first make a filter where can find the empty values and fill those values with S.

titanic.full[titanic.full$Embarked=='',"Embarked"]<-'S'

Now if we table again the Embarked column we should not get any missing values. Now look at the age

table(is.na(titanic.full$Age))

FALSE  TRUE 
 1046   263 

We can see that there are 263 values are missing. We will find the mean of the age values and then replace the missing age values with mean of ages column.

Now if we see the missing values of Age colum we will not be able to see any missing values.

table(is.na(titanic.full$Age))

FALSE 
 1309 

There is another colum having missing values, that is Fare column.

table(is.na(titanic.full$Fare))

FALSE 
 1309 

Before splitting the data we will factorize the things

Now we will split the data back in train and test dataset.

So now we will factor the Survived column also

titanic.train$Survived <- as.factor(titanic.train$Survived)

so this shows that this is supervized binary classification prediction problem. Let first do the random forest model.

Now we will apply the model on the train dataset.

titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))
The response has five or fewer unique values.  Are you sure you want to do regression?

Now we will apply model to the test dataset.

write.csv(output.df, file = "kaggle_submission.csv", row.name = FALSE)
Error in write.table(output.df, file = "kaggle_submission.csv", row.name = FALSE,  : 
  'col.names = NA' makes no sense when 'row.names = FALSE'

Since Kaggle want the answer to submit in the particular format in which there are two columns. PassengerId and Survived.

Now we will find predictive model for missing data. For Fare we will make a regression model

setwd("C:/Users/la5w7/Desktop/Spring 2018/CS-5565-Intro-to-Statistical-Learning/Project")
titanic.train <-read.csv(file="train.csv",stringsAsFactors = FALSE)
titanic.test <-read.csv(file="test.csv",stringsAsFactors = FALSE)
#str(titanic.test)
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#names(titanic.train)
#ncol(titanic.train)
#names(titanic.test)
#ncol(titanic.test)
titanic.test$Survived <- NA
titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='',"Embarked"]<-'S'
titanic.full$Pclass <- as.factor(titanic.full$Pclass)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)

First let check how many missing values are there in Fare. First let see the boxplot and see how many values are out of box

So we can see that there are lot of values which are outlined. So will get that data from boxplot.stat

boxplot.stats(titanic.full$Fare)$stats[5]
[1] 65

and make a filter as outlier.filter <-

now we will predict the values of fare accroding to the model trained for Fare prediction based on fare.equation.

titanic.full[is.na(titanic.full$Fare),"Fare"]
numeric(0)

Now we will make regression model for age prediction

boxplot(titanic.full$Age)

upper.whisker <- boxplot.stats(titanic.full$Age)$stats[5]
age.mean <- mean(titanic.full$Age, na.rm = TRUE)
titanic.full[is.na(titanic.full$Age),"Age"] <-age.mean
outlier.filter<- titanic.full$Age < upper.whisker
titanic.full[outlier.filter,]
age.equation = "Age ~ Pclass + Sex +  SibSp + Parch + Fare + Embarked"
age.model <- lm(formula = age.equation, data = titanic.full[outlier.filter,])
age.row <- titanic.full[is.na(titanic.full$Age),c("Pclass", "Sex" ,"SibSp","Parch","Fare","Embarked")]
age.predictions <-predict(age.model,newdata = age.row)
titanic.full[is.na(titanic.full$Age),"Fare"] <- age.predictions
titanic.train <-titanic.full[titanic.full$IsTrainSet == TRUE, ]
titanic.test <-titanic.full[titanic.full$IsTrainSet == FALSE, ]
titanic.train$Survived <- as.factor(titanic.train$Survived)
survived.equation <- "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked "
survived.formula <- as.formula(survived.equation)
#install.packages("randomForest")
library(randomForest)
package <U+393C><U+3E31>randomForest<U+393C><U+3E32> was built under R version 3.4.4randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))
features.equation <- "Pclass + Sex + Age + SibSp + Parch +Fare +Embarked"
Survived <- predict(titanic.model, newdata = titanic.test)
PassengerId <- titanic.test$PassengerId
output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived
write.csv(output.df, file = "kaggle_submission1.csv", row.names = FALSE)
