Classification of Ocean Microbes

Cloud database http://homes.cs.washington.edu/~billhowe/bigdatacloud/

# install.packages("caret")
# install.packages("rpart")
# install.packages("tree")
# install.packages("randomForest")
# install.packages("e1071")
# install.packages("ggplot2")

Dataset

You are provided a dataset that represents a 21 minute sample from the vessel in a file seaflow_21min.csv. This sample has been pre-processed to remove the calibration “beads” that are passed through the system for monitoring, as well as some other particle types.

The columns of this dataset are as follows:

file_id, time, cell_id, d1, d2, fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big, pop

Overview

The goal of this assignment is to use R to experiment with a real dataset – including its idiosyncrasies – and to consider the strengths and weaknesses of a few popular methods for supervised learning.

Step 1: Read and summarize the data

Using R, read the file seaflow_21min.csv and get the overall counts for each category of particle. You may consider using the functions read.csv and summary.

setwd("E:\\Dropbox\\Github\\introdata\\datasci_course_materials\\assignment5")
myData <- read.csv("seaflow_21min.csv")
summary(myData)

##     file_id           time          cell_id            d1       
##  Min.   :203.0   Min.   : 12.0   Min.   :    0   Min.   : 1328  
##  1st Qu.:204.0   1st Qu.:174.0   1st Qu.: 7486   1st Qu.: 7296  
##  Median :206.0   Median :362.0   Median :14995   Median :17728  
##  Mean   :206.2   Mean   :341.5   Mean   :15008   Mean   :17039  
##  3rd Qu.:208.0   3rd Qu.:503.0   3rd Qu.:22401   3rd Qu.:24512  
##  Max.   :209.0   Max.   :643.0   Max.   :32081   Max.   :54048  
##        d2          fsc_small        fsc_perp        fsc_big     
##  Min.   :   32   Min.   :10005   Min.   :    0   Min.   :32384  
##  1st Qu.: 9584   1st Qu.:31341   1st Qu.:13496   1st Qu.:32400  
##  Median :18512   Median :35483   Median :18069   Median :32400  
##  Mean   :17437   Mean   :34919   Mean   :17646   Mean   :32405  
##  3rd Qu.:24656   3rd Qu.:39184   3rd Qu.:22243   3rd Qu.:32416  
##  Max.   :54688   Max.   :65424   Max.   :63456   Max.   :32464  
##        pe          chl_small        chl_big           pop       
##  Min.   :    0   Min.   : 3485   Min.   :    0   crypto :  102  
##  1st Qu.: 1635   1st Qu.:22525   1st Qu.: 2800   nano   :12698  
##  Median : 2421   Median :30512   Median : 7744   pico   :20860  
##  Mean   : 5325   Mean   :30164   Mean   : 8328   synecho:18146  
##  3rd Qu.: 5854   3rd Qu.:38299   3rd Qu.:12880   ultra  :20537  
##  Max.   :58675   Max.   :64832   Max.   :57184

Question 1 How many particles labeled “synecho” are in the file provided?

18146

Question 2 What is the 3rd Quantile of the field fsc_small? (the summary function computes this on your behalf)

39184

Step 2: Split the data into test and training sets

Divide the data into two equal subsets, one for training and one for testing. Make sure to divide the data in an unbiased manner.

You might consider using either the createDataPartition function or the sample function, although there are many ways to achieve the goal.

Answer Question 3.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(3456)
inTrainingSet <- createDataPartition(myData$pop,p = 0.5,list=FALSE)
training <- myData[inTrainingSet,]
testing <- myData[-inTrainingSet,]
summary(seaflowTrain)

## Error: object 'seaflowTrain' not found

Question 3 What is the mean of the variable “time” for your training set?

341.6

Step 3: Plot the data

Plot pe against chl_small and color by pop

I recommend using the function ggplot in the library ggpplot2 rather than using base R functions, but this is not required.

Answer Question 4.

Question 4 In the plot of pe vs. chl_small, the particles labeled ultra should appear to be somewhat “mixed” with two other populations of particles. Which two populations?

library(ggplot2)
p <- ggplot(myData, aes(x = chl_small, y = pe ) )
p + geom_line(aes(colour = pop))

plot of chunk unnamed-chunk-4

ggplot(aes(x=chl_small,y=pe,color=pop),data=myData)+geom_jitter()

plot of chunk unnamed-chunk-4

nano ultra pico

Step 4: Train a decision tree.

Install the rpart library if you do not have it, and load the library.

Many statistical models in R provide an interface of the form

model <- train(formula, dataframe)

You can then use the model object to make predictions by passing it to the predict function.

A formula object describes the relationship between the independent variables and the response variable, and can be expressed with the syntax

response ~ var1 + var2 + var3

and used with the formula function to construct the formula object itself:

fol <- formula(response ~ var1 + var2 + var3)

The rpart library uses this convention. Assuming your training data is in a data frame called training and you have constructed a formula objec tcalled fol, you can construct a decision tree using the following syntax (included here to avoid you struggling with a couple of unusual aspects of R):

model <- rpart(fol, method=“class”, data=training)

Train a tree as a function of the sensor measurements: fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small

Print the model object using the print function print(model)

library(rpart)
fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
model_1 <- rpart(fol, method="class", data=training)
print(model)

## Error: object 'model' not found

# partykit enchance the plotting functions for recursive partitioning. 
# #install.packages("partykit")
library(partykit)

## Loading required package: grid

# #We can convert the rpart object  to a new class called party and plot it to see more in the terminal nodes
rpart1a <- as.party(model_1)
# 
plot(rpart1a)

plot of chunk unnamed-chunk-5

# 
library(rpart.plot)
rpart.plot(model_1,branch=0,branch.type=2,type=1,extra=102,shadow.col="pink",box.col="gray",split.col="magenta")

plot of chunk unnamed-chunk-5

#rpart.plot(model_1,branch=0,branch.type=2,type=1,extra=102,shadow.col="pink",box.col="green",split.col="red")

The output is a set of decision nodes, one node per line. Each line is indented indicating the height of the tree.

To make a prediction, walk down the tree applying the predicates to determine which branch to take. For example, in this bogus tree, a particle with chl_small=25000 and fsc_perp=1000 would take branch 2, branch 4, then branch 10, and be classified as nano.

Answer Questions 5, 6, 7.

Question 5 Use print(model) to inspect your tree. Which populations, if any, is your tree incapable of recognizing? (Which populations do not appear on any branch?) (It's possible, but very unlikely, that an incorrect answer to this question is the result of improbable sampling.) Hint: Look

crypto

Question 6 Most trees will include a node near the root that applies a rule to the pe field, where particles with a value less than some threshold will descend down one branch, and particles with a value greater than some threshold will descend down a different branch. If you look at the plot you created previously, you can verify that the threshold used in the tree is evident visually. What is the value of the threshold on the pe field learned in your model?

5001.5

Question 7 Based on your decision tree, which variables appear to be most important in predicting the class population?

chl_small

Step 5: Evaluate the decision tree on the test data.

Use the predict function to generate predictions on your test data. Then, compare these predictions with the class labels in the test data itself.

In R, if you write A==B and A and B are vectors, the result is a vector of 1s and 0s. The sum of this vector will be the number of correct predictions. Dividing this sum by the size of the test dataset will give you the accuracy.

Answer Question 8.

predict_1=predict(model_1,newdata=testing)

str(predict_1)

##  num [1:36171, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:36171] "1" "2" "4" "7" ...
##   ..$ : chr [1:5] "crypto" "nano" "pico" "synecho" ...

pop_test_1=c()
pop_names=c("crypto","nano","pico","synecho","ultra")
for (i in 1:nrow(predict_1)){
        pop_test_1=c(pop_test_1,pop_names[which.max(predict_1[i,])])
        } 
result_1=as.vector(testing$pop)==pop_test_1
table(result_1)

## result_1
## FALSE  TRUE 
##  5196 30975

accuracy_1=sum(result_1)/length(pop_test_1)

# simple way

predict_1c=predict(model_1,newdata=testing,type= "class")
table(predict_1c, testing$pop)

##           
## predict_1c crypto nano pico synecho ultra
##    crypto       0    0    0       0     0
##    nano        51 4990    1      35   663
##    pico         0    2 9368       3  1919
##    synecho      0   20   47    9035   104
##    ultra        0 1337 1014       0  7582

confusionMatrix(predict_1c, testing$pop)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crypto nano pico synecho ultra
##    crypto       0    0    0       0     0
##    nano        51 4990    1      35   663
##    pico         0    2 9368       3  1919
##    synecho      0   20   47    9035   104
##    ultra        0 1337 1014       0  7582
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8563          
##                  95% CI : (0.8527, 0.8599)
##     No Information Rate : 0.2884          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.806           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity                0.00000      0.7860      0.8982         0.9958
## Specificity                1.00000      0.9749      0.9253         0.9937
## Pos Pred Value                 NaN      0.8693      0.8296         0.9814
## Neg Pred Value             0.99859      0.9553      0.9573         0.9986
## Prevalence                 0.00141      0.1755      0.2884         0.2508
## Detection Rate             0.00000      0.1380      0.2590         0.2498
## Detection Prevalence       0.00000      0.1587      0.3122         0.2545
## Balanced Accuracy          0.50000      0.8804      0.9117         0.9948
##                      Class: ultra
## Sensitivity                0.7384
## Specificity                0.9092
## Pos Pred Value             0.7633
## Neg Pred Value             0.8976
## Prevalence                 0.2839
## Detection Rate             0.2096
## Detection Prevalence       0.2746
## Balanced Accuracy          0.8238

Accuracy : 0.8546

Step 6: Build and evaluate a random forest.

Load the randomForest library, then call randomForest using the formula object and the data, as you did to build a single tree:

install.packages("randomForest")

## Installing package into 'C:/Users/snowdj/Documents/R/win-library/3.0'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

model_6 <- randomForest(fol, data=training)

Evaluate this model on the test data the same way you did for the tree.

Answer Question 9.

Question 9 What was the accuracy of your random forest model on the test data? Enter a number between 0 and 1.

# predict_9=predict(model_6,newdata=testing)
# 
# str(predict_1)
# 
# pop_test_1=c()
# pop_names=c("crypto","nano","pico","synecho","ultra")
# for (i in 1:nrow(predict_1)){
#         pop_test_1=c(pop_test_1,pop_names[which.max(predict_1[i,])])
#         } 
# result_1=as.vector(testing$pop)==pop_test_1
# table(result_1)
# accuracy_1=sum(result_1)/length(pop_test_1)

# simple way

predict_9c=predict(model_6,newdata=testing,type= "class")
table(predict_9c, testing$pop)

##           
## predict_9c crypto  nano  pico synecho ultra
##    crypto      51     2     0       1     0
##    nano         0  5541     0       4   357
##    pico         0     0 10060       0  1365
##    synecho      0     2    10    9068    10
##    ultra        0   804   360       0  8536

confusionMatrix(predict_9c, testing$pop)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crypto  nano  pico synecho ultra
##    crypto      51     2     0       1     0
##    nano         0  5541     0       4   357
##    pico         0     0 10060       0  1365
##    synecho      0     2    10    9068    10
##    ultra        0   804   360       0  8536
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9194          
##                  95% CI : (0.9166, 0.9222)
##     No Information Rate : 0.2884          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8913          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity               1.000000      0.8727      0.9645         0.9994
## Specificity               0.999917      0.9879      0.9470         0.9992
## Pos Pred Value            0.944444      0.9388      0.8805         0.9976
## Neg Pred Value            1.000000      0.9733      0.9850         0.9998
## Prevalence                0.001410      0.1755      0.2884         0.2508
## Detection Rate            0.001410      0.1532      0.2781         0.2507
## Detection Prevalence      0.001493      0.1632      0.3159         0.2513
## Balanced Accuracy         0.999958      0.9303      0.9557         0.9993
##                      Class: ultra
## Sensitivity                0.8313
## Specificity                0.9551
## Pos Pred Value             0.8800
## Neg Pred Value             0.9346
## Prevalence                 0.2839
## Detection Rate             0.2360
## Detection Prevalence       0.2682
## Balanced Accuracy          0.8932

Random forests can automatically generate an estimate of variable importance during training by permuting the values in a given variable and measuring the effect on classification. If scrambling the values has little effect on the model's ability to make predictions, then the variable must not be very important.

A random forest can obtain another estimate of variable importance based on the Gini impurity that we discussed in the lecture. The function importance(model) prints the mean decrease in gini importance for each variable. The higher the number, the more the gini impurity score decreases by branching on this variable, indicating that the variable is more important.

Call this function and answer Question 10.

importance(model_6)

##           MeanDecreaseGini
## fsc_small        2716.6570
## fsc_perp         2131.8252
## fsc_big           198.1133
## pe               8856.7270
## chl_big          4829.5931
## chl_small        8120.4793

Step 7: Train a support vector machine model and compare results.

Use the e1071 library and call model <- svm(fol, data=trainingdata).

Answer Question 11.

library(e1071)
model_svm <- svm(fol, data=training)

Question 11 What was the accuracy of your support vector machine model on the test data? Enter a number between 0 and 1

0.9214

predict_11c=predict(model_svm,newdata=testing,type= "class")
table(predict_11c, testing$pop)

##            
## predict_11c crypto  nano  pico synecho ultra
##     crypto      50     5     0       1     0
##     nano         0  5615     0       2   420
##     pico         0     0 10043      25  1345
##     synecho      1     2    62    9044     6
##     ultra        0   727   325       1  8497

confusionMatrix(predict_11c, testing$pop)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crypto  nano  pico synecho ultra
##    crypto      50     5     0       1     0
##    nano         0  5615     0       2   420
##    pico         0     0 10043      25  1345
##    synecho      1     2    62    9044     6
##    ultra        0   727   325       1  8497
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9192         
##                  95% CI : (0.9164, 0.922)
##     No Information Rate : 0.2884         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8911         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity               0.980392      0.8844      0.9629         0.9968
## Specificity               0.999834      0.9858      0.9468         0.9974
## Pos Pred Value            0.892857      0.9301      0.8800         0.9922
## Neg Pred Value            0.999972      0.9756      0.9844         0.9989
## Prevalence                0.001410      0.1755      0.2884         0.2508
## Detection Rate            0.001382      0.1552      0.2777         0.2500
## Detection Prevalence      0.001548      0.1669      0.3155         0.2520
## Balanced Accuracy         0.990113      0.9351      0.9548         0.9971
##                      Class: ultra
## Sensitivity                0.8275
## Specificity                0.9593
## Pos Pred Value             0.8897
## Neg Pred Value             0.9335
## Prevalence                 0.2839
## Detection Rate             0.2349
## Detection Prevalence       0.2640
## Balanced Accuracy          0.8934

Step 8: Construct confusion matrices

Use the table function to generate a confusion matrix for each of your three methods. Generate predictions using the predict function, then call the table functions like this:

table(pred = predictions, true = testingdata$pop)

Answer Question 12.

Question 12 Construct a confusion matrix for each of the three methods using the table function. What appears to be the most common error the models make?

pico is mistaken for ultra

Step 9: Sanity check the data

As a data scientist, you should never trust the data, especially if you did not collect it yourself. There is no such thing as clean data. You should always be trying to prove your results wrong by finding problems with the data. Richard Feynman calls it “bending over backwards to show how you're maybe wrong.” This is even more critical in data science, because almost by definition you are using someone else's data that was collected for some other purpose rather than the experiment you want to do. So of course it's going to have problems.

The measurements in this dataset are all supposed to be continuous (fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big), but one is not. Using plots or R code, figure out which field is corrupted.

Answer Question 13

Question 13 The variables in the dataset were assumed to be continuous, but one of them takes on only a few discrete values, suggesting a problem. Which variable exhibits this problem?

library(ggplot2)
p13 <- ggplot(myData, aes(x = time , y = fsc_big ) )
p13 + geom_jitter(aes(color=pop))

plot of chunk unnamed-chunk-12

ggplot(aes(x=time,y=pe,color=pop),data=myData)+geom_jitter()

plot of chunk unnamed-chunk-12

There is more subtle issue with data as well. Plot time vs. chl_big, and you will notice a band of the data looks out of place. This band corresponds to data from a particular file for which the sensor may have been miscalibrated. Remove this data from the dataset by filtering out all data associated with file_id 208, then repeat the experiment for all three methods, making sure to split the dataset into training and test sets after filtering out the bad data.

Answer Question 14

library(ggplot2)
p14 <- ggplot(myData, aes(x = chl_big , y = time ) )
p14 + geom_jitter()

plot of chunk unnamed-chunk-13

ggplot(aes(x=chl_small,y=pe,color=pop),data=myData)+geom_jitter()

plot of chunk unnamed-chunk-13

Question 14 After removing data associated with file_id 208, what was the effect on the accuracy of your svm model? Enter a positive or negative number representing the net change in accuracy, where a positive number represents an improvement in accuracy and a negative number represents a decrease in accuracy.

Answer Question 14.

library(e1071)

myDataFilted<- subset(myData,file_id != 208)
set.seed(3456)
inTrainingSet <- createDataPartition(myDataFilted$pop,p = 0.5,list=FALSE)
training <- myDataFilted[inTrainingSet,]
testing <- myDataFilted[-inTrainingSet,]
summary(training)

##     file_id           time          cell_id            d1       
##  Min.   :203.0   Min.   : 12.0   Min.   :    0   Min.   : 1488  
##  1st Qu.:204.0   1st Qu.:146.0   1st Qu.: 7451   1st Qu.: 7336  
##  Median :206.0   Median :303.0   Median :14772   Median :17792  
##  Mean   :205.8   Mean   :308.4   Mean   :14803   Mean   :17123  
##  3rd Qu.:207.0   3rd Qu.:435.0   3rd Qu.:22081   3rd Qu.:24624  
##  Max.   :209.0   Max.   :643.0   Max.   :31307   Max.   :50896  
##        d2          fsc_small        fsc_perp        fsc_big     
##  Min.   :   32   Min.   :10011   Min.   :    0   Min.   :32384  
##  1st Qu.: 9568   1st Qu.:31383   1st Qu.:13483   1st Qu.:32400  
##  Median :18512   Median :35515   Median :18096   Median :32400  
##  Mean   :17423   Mean   :34942   Mean   :17658   Mean   :32405  
##  3rd Qu.:24624   3rd Qu.:39277   3rd Qu.:22339   3rd Qu.:32416  
##  Max.   :54688   Max.   :65424   Max.   :62173   Max.   :32464  
##        pe          chl_small        chl_big           pop      
##  Min.   :    0   Min.   : 3485   Min.   :    0   crypto :  44  
##  1st Qu.: 1627   1st Qu.:23005   1st Qu.: 2992   nano   :4784  
##  Median : 2379   Median :30768   Median : 7872   pico   :9964  
##  Mean   : 5201   Mean   :30406   Mean   : 8426   synecho:7124  
##  3rd Qu.: 5216   3rd Qu.:38474   3rd Qu.:12944   ultra  :8259  
##  Max.   :58675   Max.   :64805   Max.   :57184

model_svm <- svm(fol, data=training)

Question 14 What was the accuracy of your support vector machine model on the test data? Enter a number between 0 and 1

# type = "class" is crutial for table
predict_11c=predict(model_svm,newdata=testing,type= "class")
table(predict_11c, testing$pop)

##            
## predict_11c crypto nano pico synecho ultra
##     crypto      40    2    0       0     0
##     nano         0 4570    0       1   205
##     pico         0    0 9723      23   187
##     synecho      3    3   51    7097     2
##     ultra        0  209  190       3  7864

confusionMatrix(predict_11c, testing$pop)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crypto nano pico synecho ultra
##    crypto      40    2    0       0     0
##    nano         0 4570    0       1   205
##    pico         0    0 9723      23   187
##    synecho      3    3   51    7097     2
##    ultra        0  209  190       3  7864
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9709          
##                  95% CI : (0.9689, 0.9727)
##     No Information Rate : 0.3302          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9604          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity               0.930233      0.9553      0.9758         0.9962
## Specificity               0.999934      0.9919      0.9896         0.9974
## Pos Pred Value            0.952381      0.9569      0.9789         0.9918
## Neg Pred Value            0.999900      0.9916      0.9881         0.9988
## Prevalence                0.001425      0.1586      0.3302         0.2361
## Detection Rate            0.001326      0.1515      0.3222         0.2352
## Detection Prevalence      0.001392      0.1583      0.3292         0.2372
## Balanced Accuracy         0.965083      0.9736      0.9827         0.9968
##                      Class: ultra
## Sensitivity                0.9523
## Specificity                0.9817
## Pos Pred Value             0.9514
## Neg Pred Value             0.9820
## Prevalence                 0.2737
## Detection Rate             0.2606
## Detection Prevalence       0.2740
## Balanced Accuracy          0.9670

0.9716-0.9214

## [1] 0.0502

0.0502