Week7 BC Assign

#The Titanic Dataset consists 891 observations and 14 columns. The goal is to predict if a passenger survived given their unique attributes. Some of the attributes are; Sex, Age and Class.

#Firstly, Load packages and Read in Data


library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(caTools)
library(ggplot2)

#Make sure to set the correct  working directory where your files are saved in a csv format.

titanic = read.csv("C:/Users/polhe/Downloads/titanic (1)/train.csv") 
titanic_testing = read.csv("C:/Users/polhe/Downloads/titanic (1)/test.csv")


titanic_testing$Survived=NA #This is what we want to predict eventually
titanic_testing=titanic_testing[ ,c(12,1:11)] #Simple reorder of the columns

titanic_full=rbind(titanic,titanic_testing) # Create a full data set to use for imputing missing values and or feature engineering.

Let us take a quick look at the training dataset to see what we are working with. The majority of the training data consists of men, although there are many women as well.

## [1] "There are: 577 Men"

## [1] "There are: 314 Females"

Among the deceased is mostly men and the majority of survivors were women. “Sex” will be an important predictor.

## [1] 216

## [1] 184

## [1] 491

Even though first class had the least amount of passengers, they are way highly represented among the surviors. Class is an important predictor!

Let us also take a look at class WITH sex.

Notice that almost no first class women died!

Only one value of “Fare” is missing, we can simply impute the value based off the mean of Fare for this persons class (third class).

titanic_full[1044,'Fare']=mean(titanic_full[titanic_full$Pclass==3,]$Fare,na.rm=T)  #Impute the one missing value of Fare based on their class.
titanic_testing[153,'Fare']=mean(titanic_full[titanic_full$Pclass==3,]$Fare,na.rm=T)

Many values of “Age” are missing. We will try these missing values as a factor and create a new column AgeF to represent a range of ages.

# Let us turn age into a broad factor as well,we will consider NA a possible category that may convey some information. The alternative to this method would be to try and predict the missing age value using other predictors.


titanic_full$AgeF=cut(titanic_full$Age,seq(0,max(titanic_full$Age,na.rm=T),5)) #We will consider age as a factor split, the groups will be 
#0-5,5-10,. Plus a NA category that will represent missing age.


titanic_full$AgeF=addNA(titanic_full$AgeF) # Important! Need to add NA as an actual factor level.


titanic$AgeF=titanic_full$AgeF[1:nrow(titanic)]

titanic_testing$AgeF=titanic_full$AgeF[892:1309]

We engineer a new feature, “Title” based off information in the “Name” column. Since several “titles” appear and clearly convey more information than simply a name can at this level of analysis.

# Let us make a new column, a factor variable denoting the title of the passanger. Titles include: Mr, Mrs, Miss, Master,Rev,Dr,Lady, Countess, Dona, Capt, Jonkheer, Majorand Col. These are obtained from the "Name" column and likely will convey more information than simply a name.


#Make new column, Title
titanic_full$Title=NA


titanic_full$Title <- gsub('(.*, )|(\\..*)','',titanic_full$Name) #This trick is from Huijun Zhao, leaves everything but the skeleton of the titles. The reason this works is because all possible titles are sorrounded by either of the two above regular expressions. It could also be done manually using grep.



titanic_full$Title[titanic_full$Title %in% c('Miss','Ms','Mlle')]<-"Miss"
titanic_full$Title[titanic_full$Title %in% c("Capt","Dr","Rev","Jonkheer","Major","Col","Sir","Don")]<-"Special men"
titanic_full$Title[titanic_full$Title %in% c('Master')]<-'Master'
titanic_full$Title[titanic_full$Title %in% c("Mr")]<-"Mr"
titanic_full$Title[titanic_full$Title %in% c("Dona","Lady","the Countess")]<-"Special women"
titanic_full$Title[titanic_full$Title %in% c("Mrs","Mme")]<-"Mrs"



#Turn Title into factor

titanic_full$Title=as.factor(titanic_full$Title)

titanic$Title=titanic_full$Title[1:nrow(titanic)] 

titanic_testing$Title=titanic_full$Title[892:1309]

Modeling

titanic$Survived=as.factor(titanic$Survived) #Make sure survived is a factor not a numeric.


titanic.n=titanic %>% #For modeling we remove columns which don't seem useful. Age we have as a factor, we used title not name, etc.
select(-c(PassengerId,Name,Ticket,Cabin,Age,Embarked))


fitControl <- trainControl(method = "cv", #We will do 5 cross validation to improve accuracy
                       number = 5)



model_titanic_logistic=caret::train(Survived~.,data=titanic.n,method='glm',family='binomial',trControl=fitControl) #Works best for kaggle as of now



logistic_pred=predict(model_titanic_logistic,titanic_testing)


#Prepare to submit for kaggle
pid=titanic_testing$PassengerId # Get a vector of passenger Ids for the testing set
df.sub=data.frame(pid,logistic_pred)
names(df.sub)=c("PassengerId","Survived")

Accuracy of 0.79425 once we submit to Kaggle! Pretty good. In the end, logistic regression provided better predictions compared to Gbm, KNN or random forests.

```

Week7 BC Assign

Hunter P

3/3/2022