The objectives for this first lab are quite modest.
The intention is to get you used to these processes for your first project, with the Kaggle part being important for the group project - which has a competitive part administered on the Kaggle platform. Also, if you’re new to R or Python, then you’ll get to flex these a little.
For this lab, I suggest James et. al. section 8. You can use whatever analysis package you like, but likely it will be R or Python and there is ample material about for these. I suggest Hadley Wickham’s R for Data Science for a lot of good basics in R.
We are going to be working with the Titanic dataset on Kaggle. This has become a cliche starting-point for data analysis in the data-science area. There are lots of materials surrounding this, including a rash of user-provided tutorials of very variable quality. You can access these here:
For more general introductory things for R and Python, there are free introductory courses, which might be to your taste (personally I hate videos):
I’m not endorsing these particularly, but appear OK - there are many. Naturally they’d like you to pay for more advanced things.
Note, this document has been produced in an R notebook in R-Studio (basically R markdown with some extra features). A similar thing in Python is the Jupyter Notebook. I recommend that you explore these sorts of things as they can make your analysis life easier as well as collaboration. In short it merges the coding and analysis bits with report writing (or webpages, presenations etc).
Tasks:
Your group project will be administered through the Kaggle platform and there will be a leader-board. So today’s process has this in mind.
The following does some of this in R. There is nothing fancy going on here, so feel free explore beyond this. I have folded the code in the HTML by default, so if you want to see this, click the code button.
We’ll just do some simple summaries and a plot, then get to modelling.
#= load in some useful packages
library(ggplot2) # for pretty graphs
library(tidyverse) # lots of useful data manip tools (includes dplyr and tidyr)
library(rattle) # some data mining tools
#library(rpart.plot) # for specialised plots
#library(RColorBrewer)
library(rpart) # this is the basic tree modelling package (ugly)
To read in data, you need to know what format it is in. The file suffix is usually a clue, but may be wrong or ambiguous. Here our data is suffixed “csv” (Comma Separated Values) which suggests the values are, surprise, separated by commas. These ought to be human-readable, so you can confim by opening in a text editor (NB, some weirdo EU folks might use “;”, because they use “,” as a decimal point).
Let’s get it in and summarise a bit:
#= read in the data - you need to know where it is, and what it's called
# its a csv and the firt row are column headings
leoIsDead <- read.csv("data/train.csv", header=TRUE)
# load the test dataset - we'll need this for predictions for kaggle later
testData <- read.csv("data/test.csv", header=TRUE)
#= simple summaries
# top and bottom
head(leoIsDead); tail(leoIsDead)
# basic summaries
summary(leoIsDead)
survived pclass name sex age sibsp parch ticket
Min. :0.0000 Min. :1.000 Abbing, Mr. Anthony : 1 female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 1601 : 7
1st Qu.:0.0000 1st Qu.:2.000 Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7
Median :0.0000 Median :3.000 Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00 Median :0.000 Median :0.0000 CA. 2343: 7
Mean :0.3838 Mean :2.309 Abelson, Mr. Samuel : 1 Mean :29.70 Mean :0.523 Mean :0.3816 3101295 : 6
3rd Qu.:1.0000 3rd Qu.:3.000 Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6
Max. :1.0000 Max. :3.000 Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00 Max. :8.000 Max. :6.0000 CA 2144 : 6
(Other) :885 NA's :177 (Other) :852
fare cabin embarked
Min. : 0.00 :687 : 2
1st Qu.: 7.91 B96 B98 : 4 C:168
Median : 14.45 C23 C25 C27: 4 Q: 77
Mean : 32.20 G6 : 4 S:644
3rd Qu.: 31.00 C22 C26 : 3
Max. :512.33 D : 3
(Other) :186
# what objects are contained - here it's a simple flat thing
names(leoIsDead)
[1] "survived" "pclass" "name" "sex" "age" "sibsp" "parch" "ticket" "fare" "cabin" "embarked"
# it has simple number of rows and columns
dim(leoIsDead)
[1] 891 11
# more advanced - objects need not be simple - can have complex structure
str(leoIsDead)
'data.frame': 891 obs. of 11 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ sibsp : int 1 1 0 1 0 0 0 3 0 1 ...
$ parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
If you’re in R, I’d encourage you to do data manipulation using dplyr and tidyr (all in tidyverse) and presentation plotting via gplot2. R has the feature(?) that there are often many ways to do the same thing - but lets be prescriptive to start (in contrast for Python “There should be one - and preferably only one - obvious way to do it”. “Obvious”" is subject to debate though).
I’ll do a little here following the tidyverse tools. Note I’ll use pipes as a personal preference i.e. %>% passes the resulting dataframe from one step to the next.
#= some data manipulation
# filter/select: select only first class, then take 2 columns, rename
resultingData <- leoIsDead %>% filter(pclass == 1) %>%
select(sex, age) %>% rename(gender = sex)
package <U+393C><U+3E31>bindrcpp<U+393C><U+3E32> was built under R version 3.3.3
head(resultingData)
# group data and summarise:
leoIsDead %>% group_by(sex, pclass) %>% summarise(meanAge = mean(age, na.rm=T))
# a plot - no particular purpose - just a plot
# plcass is treated as a simple number - I'll alter that for the plot
leoIsDead <- leoIsDead %>% mutate(pclass = factor(pclass))
p <- ggplot(data=leoIsDead) + geom_point(aes(age, fare, colour = pclass)) + ggtitle('Just a plot', subtitle = 'age/class/fare')
p
NA
We’ll fit a model like seen in the lecture - I’ll leave you to build the more complex ones. One tip though - some things are codes as numbers, although they are classes. This can have consequences! For example here class, although this might be used profitably either way. Survived on the other hand, is alive/dead coded 1/0, so some caution is needed. A computer may happily treat this as a number when you would have preferred not.
#= women and children first!
# a simple tree. Note "class" means the response is treated as a factor, not a number
wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead)
#= plot the tree
plot(wacfTrain, uniform=TRUE,
main="Women and Children First", margin = 0.1)
text(wacfTrain, use.n=TRUE, all=TRUE, cex=.8)
#= the rpart plot is terrible, so get something better
fancyRpartPlot(wacfTrain)
So, you’ll recall from the lecture (you attended yes?) that we use recursive binary partioning i.e. keep on splitting the covariate space up. Each split increases the number of sub-spaces by 1. The tree shows the conditions for each split as you go down the tree (nodes) starting at the stump giving branches each time. Eventually we stop (reasons why to be covered in the next lectures) giving terminal nodes.
We can look inside the model object.
# generic summary
summary(wacfTrain)
Call:
rpart(formula = survived ~ sex + age, data = leoIsDead, method = "class")
n= 891
CP nsplit rel error xerror xstd
1 0.44444444 0 1.0000000 1.0000000 0.04244576
2 0.02339181 1 0.5555556 0.5555556 0.03574957
3 0.01000000 2 0.5321637 0.5614035 0.03588593
Variable importance
sex age
92 8
Node number 1: 891 observations, complexity param=0.4444444
predicted class=0 expected loss=0.3838384 P(node) =1
class counts: 549 342
probabilities: 0.616 0.384
left son=2 (577 obs) right son=3 (314 obs)
Primary splits:
sex splits as RL, improve=124.426300, (0 missing)
age < 6.5 to the right, improve= 8.814172, (177 missing)
Node number 2: 577 observations, complexity param=0.02339181
predicted class=0 expected loss=0.1889081 P(node) =0.647587
class counts: 468 109
probabilities: 0.811 0.189
left son=4 (553 obs) right son=5 (24 obs)
Primary splits:
age < 6.5 to the right, improve=10.78893, (124 missing)
Node number 3: 314 observations
predicted class=1 expected loss=0.2579618 P(node) =0.352413
class counts: 81 233
probabilities: 0.258 0.742
Node number 4: 553 observations
predicted class=0 expected loss=0.1681736 P(node) =0.620651
class counts: 460 93
probabilities: 0.832 0.168
Node number 5: 24 observations
predicted class=1 expected loss=0.3333333 P(node) =0.02693603
class counts: 8 16
probabilities: 0.333 0.667
# what's inside?
names(wacfTrain)
[1] "frame" "where" "call" "terms" "cptable" "method" "parms"
[8] "control" "functions" "numresp" "splits" "csplit" "variable.importance" "y"
[15] "ordered"
str(wacfTrain)
List of 15
$ frame :'data.frame': 5 obs. of 9 variables:
..$ var : Factor w/ 3 levels "<leaf>","age",..: 3 2 1 1 1
..$ n : int [1:5] 891 577 553 24 314
..$ wt : num [1:5] 891 577 553 24 314
..$ dev : num [1:5] 342 109 93 8 81
..$ yval : num [1:5] 1 1 1 2 2
..$ complexity: num [1:5] 0.44444 0.02339 0 0 0.00877
..$ ncompete : int [1:5] 1 0 0 0 0
..$ nsurrogate: int [1:5] 0 0 0 0 0
..$ yval2 : num [1:5, 1:6] 1 1 1 2 2 549 468 460 8 81 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:6] "" "" "" "" ...
$ where : Named int [1:891] 3 5 5 5 3 3 3 4 5 5 ...
..- attr(*, "names")= chr [1:891] "1" "2" "3" "4" ...
$ call : language rpart(formula = survived ~ sex + age, data = leoIsDead, method = "class")
$ terms :Classes 'terms', 'formula' language survived ~ sex + age
.. ..- attr(*, "variables")= language list(survived, sex, age)
.. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:3] "survived" "sex" "age"
.. .. .. ..$ : chr [1:2] "sex" "age"
.. ..- attr(*, "term.labels")= chr [1:2] "sex" "age"
.. ..- attr(*, "order")= int [1:2] 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(survived, sex, age)
.. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "factor" "numeric"
.. .. ..- attr(*, "names")= chr [1:3] "survived" "sex" "age"
$ cptable : num [1:3, 1:5] 0.4444 0.0234 0.01 0 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:5] "CP" "nsplit" "rel error" "xerror" ...
$ method : chr "class"
$ parms :List of 3
..$ prior: num [1:2(1d)] 0.616 0.384
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr [1:2] "1" "2"
..$ loss : num [1:2, 1:2] 0 1 1 0
..$ split: num 1
$ control :List of 9
..$ minsplit : int 20
..$ minbucket : num 7
..$ cp : num 0.01
..$ maxcompete : int 4
..$ maxsurrogate : int 5
..$ usesurrogate : int 2
..$ surrogatestyle: int 0
..$ maxdepth : int 30
..$ xval : int 10
$ functions :List of 3
..$ summary:function (yval, dev, wt, ylevel, digits)
..$ print :function (yval, ylevel, digits)
..$ text :function (yval, dev, wt, ylevel, digits, n, use.n)
$ numresp : int 4
$ splits : num [1:3, 1:5] 891 714 453 2 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "sex" "age" "age"
.. ..$ : chr [1:5] "count" "ncat" "improve" "index" ...
$ csplit : int [1, 1:2] 3 1
$ variable.importance: Named num [1:2] 124.4 10.8
..- attr(*, "names")= chr [1:2] "sex" "age"
$ y : int [1:891] 1 2 2 2 1 1 1 1 2 2 ...
$ ordered : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "sex" "age"
- attr(*, "xlevels")=List of 1
..$ sex: chr [1:2] "female" "male"
- attr(*, "ylevels")= chr [1:2] "0" "1"
- attr(*, "class")= chr "rpart"
We’ll look to prune trees in future i.e. figure out how complex they ought to be for low generalisation error. However, here the tree is simple.
Predicting is relatively straight-forwards, however, if it is a classification, there are different sorts you might request. There are immediately two - the probability of being in a class, or the simple class prediction.
I’ll just go for the class in the first instance.
# model predictions for the training set
treePreds <- predict(wacfTrain, type='class')
summary(treePreds)
0 1
553 338
We can now use these to look at misclassification rates etc (for the training data).
# how are we doing? A confusion matrix for the training data
table(dnn = c('observed', 'predicted'), leoIsDead$survived, treePreds)
predicted
observed 0 1
0 460 89
1 93 249
dim(leoIsDead)
[1] 891 11
So, our misclassification rate is (89+93)/891, the proportion of observations in the off-diagonal positions (the false positives/negatives).
For a Kaggle competition, we need to make predictions for a test set that we don’t know the answers to - we can then upload to Kaggle for it to calculate how we did (Kaggle knows the response). So, we pass our test covariates into our model - we need predicted probabilities for this case.
# I think this is the format needed - we'll find out
kagglePreds <- predict(wacfTrain, testData, type='prob')
write.csv(kagglePreds,"myKagglePreds.csv", row.names=FALSE)
# note it has two columns - prob of each class.
head(kagglePreds)
0 1
1 0.8318264 0.1681736
2 0.2579618 0.7420382
3 0.8318264 0.1681736
4 0.8318264 0.1681736
5 0.2579618 0.7420382
6 0.8318264 0.1681736
This should have stepped you through some basics. From here:
Next time, we’ll be more concerned about getting models optimised for their generalisation error.