DataScience - Assignment 5 - Ocean Microbes

last compiled: 2014-08-27 01:50:39

Load Data

seaflow <- read.csv("seaflow_21min.csv", header=TRUE)

Question 1 + 2

How many particles labeled “synecho” are in the file provided?
What is the 3rd Quantile of the field fsc_small? (the summary function computes this on your behalf)

table(seaflow$pop)

## 
##  crypto    nano    pico synecho   ultra 
##     102   12698   20860   18146   20537

summary(seaflow$fsc_small)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10000   31300   35500   34900   39200   65400

A1: 18146
A2: 39180 or 39200

Question 3

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(1234)
trainIndex <- createDataPartition(seaflow$cell_id, p = 0.5, list=FALSE, times=1)
sfTrain <- seaflow[trainIndex, ]
sfTest <- seaflow[-trainIndex, ]

What is the mean of the variable “time” for your training set?

summary(sfTrain$time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      12     176     362     342     504     643

mean(sfTrain$time)

## [1] 342.2

A3: 341.5

Question 4

Plot pe against chl_small and color by pop

In the plot of pe vs. chl_small, the particles labeled ultra should appear to be somewhat “mixed” with two other populations of particles. Which two populations?

require(ggplot2)
qplot(pe, chl_small, data=seaflow, color=pop, alpha=1/8)

plot of chunk Q4

qplot(pe, chl_small, data=seaflow, facets=pop~., color=pop, alpha=1/8)

plot of chunk Q4 A4: nano and pico

Question 5, 6, 7

Train a rpart Decision Tree

library(rpart)
set.seed(1234)

#generate pop predictions using logistic regression
sf_glm <- glm(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small,
              data = sfTrain, 
              family=binomial("logit"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#uses rpart to generate recursive partitioning regression tree
pop_glm <- rpart(sf_glm, method="class", data=sfTrain) 
print(pop_glm) #prints decision tree

## n= 36172 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 36172 25840 ultra (0.0015 0.18 0.29 0.25 0.29)  
##    2) pe< 5006 26360 16080 pico (0 0.22 0.39 0.00011 0.39)  
##      4) chl_small< 3.225e+04 11388  1991 pico (0 0.00018 0.83 0.00026 0.17) *
##      5) chl_small>=3.225e+04 14972  6762 ultra (0 0.39 0.059 0 0.55)  
##       10) chl_small>=4.137e+04 5107   584 nano (0 0.89 0.0002 0 0.11) *
##       11) chl_small< 4.137e+04 9865  2238 ultra (0 0.14 0.089 0 0.77) *
##    3) pe>=5006 9812   772 synecho (0.0057 0.055 0.0038 0.92 0.014)  
##      6) chl_small>=3.771e+04 674   143 nano (0.083 0.79 0 0.061 0.068) *
##      7) chl_small< 3.771e+04 9138   139 synecho (0 0.0012 0.004 0.98 0.01) *

# plots decision tree
plot(pop_glm, main="Classification Tree of Ocean Microbes using Seaflow")
text(pop_glm, use.n=TRUE, all=TRUE, cex=0.8)

plot of chunk trainGLM

#another way of verifying answer in Q6 is determining the minimum value for synecho
summary(seaflow[(seaflow$pop=="synecho"), "pe"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000    8960   13200   14300   18400   44600

Q5
Use print(model) to inspect your tree. Which populations, if any, is your tree incapable of recognizing? (Which populations do not appear on any branch?) (It’s possible, but very unlikely, that an incorrect answer to this question is the result of improbable sampling.) Hint: Look
A5 crypto does not appear on any branch

Q6
Most trees will include a node near the root that applies a rule to the pe field, where particles with a value less than some threshold will descend down one branch, and particles with a value greater than some threshold will descend down a different branch. If you look at the plot you created previously, you can verify that the threshold used in the tree is evident visually. What is the value of the threshold on the pe field learned in your model?
A6 5004

Q7
Based on your decision tree, which variables appear to be most important in predicting the class population?
A6 All decisions in my tree were made on either chl_small or pe

Question 8

Evaluate the accuracy of the rpart decision tree on the test data.

#use fitted glm model to predict pop in sfTest
pred_glm <- predict(pop_glm, newdata=sfTest, type="class") # type="class" outputs the highest probability class

#the following code gives raw probabilities for each class
#pred_glm_raw <- predict(pop_glm, newdata=sfTest) 

pred_glm <- as.data.frame(pred_glm) #convert matrix to df
sfTest <- cbind(sfTest, pred_glm)
#calculate the accuracy by dividing correct predictions by the number of observations in the test dataset 
accuracy_glm <- sum(pred_glm$pred_glm == sfTest$pop)/NROW(sfTest)
accuracy_glm

## [1] 0.8531

Q8
How accurate was your decision tree on the test data? Enter a number between 0 and 1.
A8 0.8538332

Question 9

Build and evaluate a random forest

library(randomForest)

cv.ctrl <- trainControl(method = "repeatedcv", repeats = 3)

pop_rf <- train(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small,
                data = sfTrain, 
                method = "rf",
                trControl = cv.ctrl)

pred_rf <- predict(pop_rf, newdata=sfTest, type="raw")

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

pred_rf <- as.data.frame(pred_rf)
sfTest <- cbind(sfTest, pred_rf)
accuracy_rf <- sum(pred_rf$pred_rf == sfTest$pop)/NROW(sfTest)
accuracy_rf

## [1] 0.9215

Q9
What was the accuracy of your random forest model on the test data? Enter a number between 0 and 1.
A9 0.9197423

Question 10

A random forest can obtain another estimate of variable importance based on the Gini impurity that we discussed in the lecture. The function importance(model) prints the mean decrease in gini importance for each variable. The higher the number, the more the gini impurity score decreases by branching on this variable, indicating that the variable is more important.

#importance(pop_rf) ## doesn't work because RF generated with caret train rather than randomForest

Q10
After calling importance(model), you should be able to determine which variables appear to be most important in terms of the gini impurity measure. Which ones are they? A10 couldn’t get importance to work but I’m going to guess that pe and chl_small are still the most important variables

Question 11

Train a support vector machine model and compare results.

require(e1071)

## Loading required package: e1071

#train support vector machine
pop_svm <- svm(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small, 
               data=sfTrain)
#generate svm predictions
pred_svm <- predict(pop_svm, newdata=sfTest, type="raw")
pred_svm <- as.data.frame(pred_svm)
sfTest <- cbind(sfTest, pred_svm)

#calculate accuracy
accuracy_svm <- sum(pred_svm$pred_svm == sfTest$pop)/NROW(sfTest)
accuracy_svm

## [1] 0.9208

Q11
What was the accuracy of your support vector machine model on the test data? Enter a number between 0 and 1. A11 0.9199635

Question 12

Use the table function to generate a confusion matrix for each of your three methods. Generate predictions using the predict function, then call the table functions like this: table(pred = predictions, true = testingdata$pop)

#confusion matrices don't exist
pred_glm <- predict(pop_glm, newdata=sfTest, type="class")
pred_rf <- predict(pop_rf, newdata=sfTest, type="raw")
pred_svm <- predict(pop_svm, newdata=sfTest, type="raw")

table(pred=pred_glm, true=sfTest$pop)

##          true
## pred      crypto nano pico synecho ultra
##   crypto       0    0    0       0     0
##   nano        46 4923    0      49   720
##   pico         0    2 9523       2  2021
##   synecho      0   10   55    9052   103
##   ultra        0 1337  968       0  7360

table(pred=pred_rf, true=sfTest$pop)

##          true
## pred      crypto  nano  pico synecho ultra
##   crypto      46     1     0       1     0
##   nano         0  5524     0       2   378
##   pico         0     0 10220       0  1375
##   synecho      0     1    14    9100    10
##   ultra        0   746   312       0  8441

table(pred=pred_svm, true=sfTest$pop)

##          true
## pred      crypto  nano  pico synecho ultra
##   crypto      43     4     0       1     0
##   nano         1  5584     0       3   416
##   pico         0     0 10150      26  1322
##   synecho      2     2    67    9070     7
##   ultra        0   682   329       3  8459

Q12
Construct a confusion matrix for each of the three methods using the table function. What appears to be the most common error the models make?
A12 In all three cases it appears that “ultra is mistaken for pico”

Question 13

As a data scientist, you should never trust the data, especially if you did not collect it yourself. There is no such thing as clean data. You should always be trying to prove your results wrong by finding problems with the data. Richard Feynman calls it “bending over backwards to show how you’re maybe wrong.” This is even more critical in data science, because almost by definition you are using someone else’s data that was collected for some other purpose rather than the experiment you want to do. So of course it’s going to have problems. The measurements in this dataset are all supposed to be continuous (fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big), but one is not. Using plots or R code, figure out which field is corrupted.

hist(seaflow$fsc_big, col="darkred")

plot of chunk Q13

Q13
The variables in the dataset were assumed to be continuous, but one of them takes on only a few discrete values, suggesting a problem. Which variable exhibits this problem?
A13 It appears fsc_big is the corrupting variable.

Question 14

After removing data associated with file_id 208, what was the effect on the accuracy of your svm model? Enter a positive or negative number representing the net change in accuracy, where a positive number represents an improvement in accuracy and a negative number represents a decrease in accuracy.

#plot time vs. chl_big
qplot(time, chl_big, data=seaflow, color=as.factor(file_id))

plot of chunk Q14

#remove file_id 208
require(caret)
seaflow2 <- seaflow[(seaflow$file_id != "208"),]
set.seed(1234)
trainIndex2 <- createDataPartition(seaflow2$cell_id, p = 0.5, list=FALSE, times=1)
sfTrain2 <- seaflow2[trainIndex2, ]
sfTest2 <- seaflow2[-trainIndex2, ] 

require(e1071)
#train support vector machine
pop_svm2 <- svm(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small, 
               data=sfTrain2)
#generate svm predictions
pred_svm2 <- predict(pop_svm2, newdata=sfTest2, type="raw")
###pred_svm2 <- as.data.frame(pred_svm2)
sfTest2 <- cbind(sfTest2, pred_svm2)

#calculate accuracy
accuracy_svm2 <- sum(pred_svm2 == sfTest2$pop)/NROW(sfTest2)
accuracy_svm2 - accuracy_svm

## [1] 0.04994

A14 Prediction accuracy for svm increased by 0.05120277