last compiled: 2014-08-27 01:50:39
seaflow <- read.csv("seaflow_21min.csv", header=TRUE)
table(seaflow$pop)
##
## crypto nano pico synecho ultra
## 102 12698 20860 18146 20537
summary(seaflow$fsc_small)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10000 31300 35500 34900 39200 65400
A1: 18146
A2: 39180 or 39200
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1234)
trainIndex <- createDataPartition(seaflow$cell_id, p = 0.5, list=FALSE, times=1)
sfTrain <- seaflow[trainIndex, ]
sfTest <- seaflow[-trainIndex, ]
What is the mean of the variable “time” for your training set?
summary(sfTrain$time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12 176 362 342 504 643
mean(sfTrain$time)
## [1] 342.2
A3: 341.5
Plot pe against chl_small and color by pop
In the plot of pe vs. chl_small, the particles labeled ultra should appear to be somewhat “mixed” with two other populations of particles. Which two populations?
require(ggplot2)
qplot(pe, chl_small, data=seaflow, color=pop, alpha=1/8)
qplot(pe, chl_small, data=seaflow, facets=pop~., color=pop, alpha=1/8)
A4: nano and pico
Train a rpart Decision Tree
library(rpart)
set.seed(1234)
#generate pop predictions using logistic regression
sf_glm <- glm(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small,
data = sfTrain,
family=binomial("logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#uses rpart to generate recursive partitioning regression tree
pop_glm <- rpart(sf_glm, method="class", data=sfTrain)
print(pop_glm) #prints decision tree
## n= 36172
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 36172 25840 ultra (0.0015 0.18 0.29 0.25 0.29)
## 2) pe< 5006 26360 16080 pico (0 0.22 0.39 0.00011 0.39)
## 4) chl_small< 3.225e+04 11388 1991 pico (0 0.00018 0.83 0.00026 0.17) *
## 5) chl_small>=3.225e+04 14972 6762 ultra (0 0.39 0.059 0 0.55)
## 10) chl_small>=4.137e+04 5107 584 nano (0 0.89 0.0002 0 0.11) *
## 11) chl_small< 4.137e+04 9865 2238 ultra (0 0.14 0.089 0 0.77) *
## 3) pe>=5006 9812 772 synecho (0.0057 0.055 0.0038 0.92 0.014)
## 6) chl_small>=3.771e+04 674 143 nano (0.083 0.79 0 0.061 0.068) *
## 7) chl_small< 3.771e+04 9138 139 synecho (0 0.0012 0.004 0.98 0.01) *
# plots decision tree
plot(pop_glm, main="Classification Tree of Ocean Microbes using Seaflow")
text(pop_glm, use.n=TRUE, all=TRUE, cex=0.8)
#another way of verifying answer in Q6 is determining the minimum value for synecho
summary(seaflow[(seaflow$pop=="synecho"), "pe"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5000 8960 13200 14300 18400 44600
Q5
Use print(model) to inspect your tree. Which populations, if any, is your tree incapable of recognizing? (Which populations do not appear on any branch?) (It’s possible, but very unlikely, that an incorrect answer to this question is the result of improbable sampling.) Hint: Look
A5 crypto does not appear on any branch
Q6
Most trees will include a node near the root that applies a rule to the pe field, where particles with a value less than some threshold will descend down one branch, and particles with a value greater than some threshold will descend down a different branch. If you look at the plot you created previously, you can verify that the threshold used in the tree is evident visually. What is the value of the threshold on the pe field learned in your model?
A6 5004
Q7
Based on your decision tree, which variables appear to be most important in predicting the class population?
A6 All decisions in my tree were made on either chl_small or pe
Evaluate the accuracy of the rpart decision tree on the test data.
#use fitted glm model to predict pop in sfTest
pred_glm <- predict(pop_glm, newdata=sfTest, type="class") # type="class" outputs the highest probability class
#the following code gives raw probabilities for each class
#pred_glm_raw <- predict(pop_glm, newdata=sfTest)
pred_glm <- as.data.frame(pred_glm) #convert matrix to df
sfTest <- cbind(sfTest, pred_glm)
#calculate the accuracy by dividing correct predictions by the number of observations in the test dataset
accuracy_glm <- sum(pred_glm$pred_glm == sfTest$pop)/NROW(sfTest)
accuracy_glm
## [1] 0.8531
Q8
How accurate was your decision tree on the test data? Enter a number between 0 and 1.
A8 0.8538332
Build and evaluate a random forest
library(randomForest)
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 3)
pop_rf <- train(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small,
data = sfTrain,
method = "rf",
trControl = cv.ctrl)
pred_rf <- predict(pop_rf, newdata=sfTest, type="raw")
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
pred_rf <- as.data.frame(pred_rf)
sfTest <- cbind(sfTest, pred_rf)
accuracy_rf <- sum(pred_rf$pred_rf == sfTest$pop)/NROW(sfTest)
accuracy_rf
## [1] 0.9215
Q9
What was the accuracy of your random forest model on the test data? Enter a number between 0 and 1.
A9 0.9197423
A random forest can obtain another estimate of variable importance based on the Gini impurity that we discussed in the lecture. The function importance(model) prints the mean decrease in gini importance for each variable. The higher the number, the more the gini impurity score decreases by branching on this variable, indicating that the variable is more important.
#importance(pop_rf) ## doesn't work because RF generated with caret train rather than randomForest
Q10
After calling importance(model), you should be able to determine which variables appear to be most important in terms of the gini impurity measure. Which ones are they? A10 couldn’t get importance to work but I’m going to guess that pe and chl_small are still the most important variables
Train a support vector machine model and compare results.
require(e1071)
## Loading required package: e1071
#train support vector machine
pop_svm <- svm(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small,
data=sfTrain)
#generate svm predictions
pred_svm <- predict(pop_svm, newdata=sfTest, type="raw")
pred_svm <- as.data.frame(pred_svm)
sfTest <- cbind(sfTest, pred_svm)
#calculate accuracy
accuracy_svm <- sum(pred_svm$pred_svm == sfTest$pop)/NROW(sfTest)
accuracy_svm
## [1] 0.9208
Q11
What was the accuracy of your support vector machine model on the test data? Enter a number between 0 and 1. A11 0.9199635
Use the table function to generate a confusion matrix for each of your three methods. Generate predictions using the predict function, then call the table functions like this: table(pred = predictions, true = testingdata$pop)
#confusion matrices don't exist
pred_glm <- predict(pop_glm, newdata=sfTest, type="class")
pred_rf <- predict(pop_rf, newdata=sfTest, type="raw")
pred_svm <- predict(pop_svm, newdata=sfTest, type="raw")
table(pred=pred_glm, true=sfTest$pop)
## true
## pred crypto nano pico synecho ultra
## crypto 0 0 0 0 0
## nano 46 4923 0 49 720
## pico 0 2 9523 2 2021
## synecho 0 10 55 9052 103
## ultra 0 1337 968 0 7360
table(pred=pred_rf, true=sfTest$pop)
## true
## pred crypto nano pico synecho ultra
## crypto 46 1 0 1 0
## nano 0 5524 0 2 378
## pico 0 0 10220 0 1375
## synecho 0 1 14 9100 10
## ultra 0 746 312 0 8441
table(pred=pred_svm, true=sfTest$pop)
## true
## pred crypto nano pico synecho ultra
## crypto 43 4 0 1 0
## nano 1 5584 0 3 416
## pico 0 0 10150 26 1322
## synecho 2 2 67 9070 7
## ultra 0 682 329 3 8459
Q12
Construct a confusion matrix for each of the three methods using the table function. What appears to be the most common error the models make?
A12 In all three cases it appears that “ultra is mistaken for pico”
As a data scientist, you should never trust the data, especially if you did not collect it yourself. There is no such thing as clean data. You should always be trying to prove your results wrong by finding problems with the data. Richard Feynman calls it “bending over backwards to show how you’re maybe wrong.” This is even more critical in data science, because almost by definition you are using someone else’s data that was collected for some other purpose rather than the experiment you want to do. So of course it’s going to have problems. The measurements in this dataset are all supposed to be continuous (fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big), but one is not. Using plots or R code, figure out which field is corrupted.
hist(seaflow$fsc_big, col="darkred")
Q13
The variables in the dataset were assumed to be continuous, but one of them takes on only a few discrete values, suggesting a problem. Which variable exhibits this problem?
A13 It appears fsc_big is the corrupting variable.
After removing data associated with file_id 208, what was the effect on the accuracy of your svm model? Enter a positive or negative number representing the net change in accuracy, where a positive number represents an improvement in accuracy and a negative number represents a decrease in accuracy.
#plot time vs. chl_big
qplot(time, chl_big, data=seaflow, color=as.factor(file_id))
#remove file_id 208
require(caret)
seaflow2 <- seaflow[(seaflow$file_id != "208"),]
set.seed(1234)
trainIndex2 <- createDataPartition(seaflow2$cell_id, p = 0.5, list=FALSE, times=1)
sfTrain2 <- seaflow2[trainIndex2, ]
sfTest2 <- seaflow2[-trainIndex2, ]
require(e1071)
#train support vector machine
pop_svm2 <- svm(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small,
data=sfTrain2)
#generate svm predictions
pred_svm2 <- predict(pop_svm2, newdata=sfTest2, type="raw")
###pred_svm2 <- as.data.frame(pred_svm2)
sfTest2 <- cbind(sfTest2, pred_svm2)
#calculate accuracy
accuracy_svm2 <- sum(pred_svm2 == sfTest2$pop)/NROW(sfTest2)
accuracy_svm2 - accuracy_svm
## [1] 0.04994
A14 Prediction accuracy for svm increased by 0.05120277