Title: Classification of Ocean Microbes

This is a project of Introduction to Data Science (coursera), working with data from the SeaFlow environmental flow cytometry instrument.

A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, you can determine how large the particle is and some information about its color and other properties, allowing you to detect it.

The technology was developed for medical applciations, where the particles were potential pathogens in, say, serum, and the goal was to give a diagnosis. But the technology was adapted for use in environmental science to understand microbial population profiles.

The SeaFlow instrument, developed by the Armbrust Lab at the University of Washington, is unique in that it is deployed on research vessels and takes continuous measurements of population profiles in the open ocean.

Dataset

Here the dataset represents a 21 minute sample from the vessel in a file seaflow_21min.csv. This sample has been pre-processed to remove the calibration “beads” that are passed through the system for monitoring, as well as some other particle types.

The columns of this dataset are as follows:

file_id, time, cell_id, d1, d2, fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big, pop
file_id: The data arrives in files, where each file represents a three-minute window; this field represents which file the data came from. The number is ordered by time, but is otherwise not significant. time: This is an integer representing the time the particle passed through the instrument. Many particles may arrive at the same time; time is not a key for this relation. cell_id: A unique identifier for each cell WITHIN a file. (file_id, cell_id) is a key for this relation. d1, d2: Intensity of light at the two main sensors, oriented perpendicularly. These sensors are primarily used to determine whether the particles are properly centered in the stream. Used primarily in preprocesssing; they are unlikely to be useful for classification. fsc_small, fsc_perp, fsc_big: Forward scatter small, perpendicular, and big. These values help distingish different sizes of particles. pe: A measurement of phycoerythrin fluorescence, which is related to the wavelength associated with an orange color in microorganisms chl_small, chl_big: Measurements related to the wavelength of light corresponding to chlorophyll. pop: This is the class label assigned by the clustering mechanism used in the production system. It can be considered “ground truth” for the purposes of the assignment, but note that there are particles that cannot be unambiguously classified, so you should not aim for 100% accuracy. The values in this column are crypto, nano, pico, synecho, and ultra

Step 1: Read and summarize the data

data <- read.csv("seaflow_21min.csv")
summary(data)
##     file_id         time        cell_id            d1       
##  Min.   :203   Min.   : 12   Min.   :    0   Min.   : 1328  
##  1st Qu.:204   1st Qu.:174   1st Qu.: 7486   1st Qu.: 7296  
##  Median :206   Median :362   Median :14995   Median :17728  
##  Mean   :206   Mean   :342   Mean   :15008   Mean   :17039  
##  3rd Qu.:208   3rd Qu.:503   3rd Qu.:22401   3rd Qu.:24512  
##  Max.   :209   Max.   :643   Max.   :32081   Max.   :54048  
##        d2          fsc_small        fsc_perp        fsc_big     
##  Min.   :   32   Min.   :10005   Min.   :    0   Min.   :32384  
##  1st Qu.: 9584   1st Qu.:31341   1st Qu.:13496   1st Qu.:32400  
##  Median :18512   Median :35483   Median :18069   Median :32400  
##  Mean   :17437   Mean   :34919   Mean   :17646   Mean   :32405  
##  3rd Qu.:24656   3rd Qu.:39184   3rd Qu.:22243   3rd Qu.:32416  
##  Max.   :54688   Max.   :65424   Max.   :63456   Max.   :32464  
##        pe          chl_small        chl_big           pop       
##  Min.   :    0   Min.   : 3485   Min.   :    0   crypto :  102  
##  1st Qu.: 1635   1st Qu.:22525   1st Qu.: 2800   nano   :12698  
##  Median : 2421   Median :30512   Median : 7744   pico   :20860  
##  Mean   : 5325   Mean   :30164   Mean   : 8328   synecho:18146  
##  3rd Qu.: 5854   3rd Qu.:38299   3rd Qu.:12880   ultra  :20537  
##  Max.   :58675   Max.   :64832   Max.   :57184
table(data$file_id)
## 
##   203   204   205   206   207   208   209 
## 10454  9435  8376  9215 11444 11995 11424
hist(data$time)

plot of chunk unnamed-chunk-1

plot(data$file_id,data$cell_id)

plot of chunk unnamed-chunk-1

plot(data$d1,data$d2)

plot of chunk unnamed-chunk-1

plot(data$fsc_small,data$fsc_perp)

plot of chunk unnamed-chunk-1

hist(data$fsc_big)

plot of chunk unnamed-chunk-1

hist(data$pe)

plot of chunk unnamed-chunk-1

plot(data$chl_small,data$chl_big)

plot of chunk unnamed-chunk-1

table(data$pop)
## 
##  crypto    nano    pico synecho   ultra 
##     102   12698   20860   18146   20537

Step 2: Split the data into test and train

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
ind <- createDataPartition(y=data$pop, p=0.5, list=F)
mytrain <- data[ind,]
mytest  <- data[-ind,]

mean(mytrain$time)
## [1] 340.9

step 3: Plot the data

 library(ggplot2)
#qplot(x=chl_small, y= pe, data=data, fill=as.factor(pop))
ggplot(data, aes(chl_small, pe, color=pop))+geom_line()

plot of chunk plotdata

Step 4: Train a decision tree (rpart)

f1 <- formula(pop~fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
#model <- train(f1, data)
library(rpart)
model <- rpart(f1, method="class", data=mytrain)
print(model)
## n= 36172 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 36172 25740 pico (0.0014 0.18 0.29 0.25 0.28)  
##    2) pe< 5006 26334 15950 pico (0 0.22 0.39 3.8e-05 0.38)  
##      4) chl_small< 3.265e+04 11962  2296 pico (0 0.0005 0.81 8.4e-05 0.19) *
##      5) chl_small>=3.265e+04 14372  6532 ultra (0 0.4 0.05 0 0.55)  
##       10) chl_small>=4.129e+04 5186   669 nano (0 0.87 0 0 0.13) *
##       11) chl_small< 4.129e+04 9186  2015 ultra (0 0.14 0.078 0 0.78) *
##    3) pe>=5006 9838   766 synecho (0.0052 0.054 0.0045 0.92 0.014)  
##      6) chl_small>=3.803e+04 655   138 nano (0.078 0.79 0 0.069 0.064) *
##      7) chl_small< 3.803e+04 9183   156 synecho (0 0.0015 0.0048 0.98 0.011) *

Step 5: Evaluate the decision tree on the test data

pred <- predict(model, newdata=mytest,type="class")
table(mytest$pop, pred)
##          pred
##           crypto nano pico synecho ultra
##   crypto       0   51    0       0     0
##   nano         0 5044    3       8  1294
##   pico         0    1 9664      48   717
##   synecho      0   39    4    9030     0
##   ultra        0  713 2263     101  7191
mean(mytest$pop==pred)
## [1] 0.8551

Step 6: Build and Evaluate a random forest

Random forests can automatically generate an estimate of variable importance during training by permuting the values in a given variable and measuring the effect on classification. If scrambling the values has little effect on the model's ability to make predictions, then the variable must not be very important.

A random forest can obtain another estimate of variable importance based on the Gini impurity that we discussed in the lecture. The function importance(model) prints the mean decrease in gini importance for each variable. The higher the number, the more the gini impurity score decreases by branching on this variable, indicating that the variable is more important.

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(f1, data=mytrain)
pred <- predict(model, newdata=mytest,type="class")
table(mytest$pop, pred)
##          pred
##           crypto  nano  pico synecho ultra
##   crypto      51     0     0       0     0
##   nano         0  5565     0       4   780
##   pico         0     0 10112       9   309
##   synecho      0     4     0    9068     1
##   ultra        0   378  1345       7  8538
mean(mytest$pop==pred)
## [1] 0.9216
importance(model)
##           MeanDecreaseGini
## fsc_small           2725.2
## fsc_perp            2095.3
## fsc_big              205.9
## pe                  8900.5
## chl_big             4776.7
## chl_small           8151.1

Step 7: Train a support vector machine model and compare results.

Use the e1071 library and call model <- svm(fol, data=trainingdata).

Answer Question 11.

library(e1071)
model <- svm(f1, data=mytrain)
pred <- predict(model, newdata=mytest,type="class")
table(mytest$pop, pred)
##          pred
##           crypto  nano  pico synecho ultra
##   crypto      47     1     0       3     0
##   nano         1  5599     0       4   745
##   pico         0     0 10062      60   308
##   synecho      0     2    23    9048     0
##   ultra        0   396  1317       9  8546
mean(mytest$pop==pred)
## [1] 0.9207

Step 8: Construct confusion matrices

Use the table function to generate a confusion matrix for each of your three methods. Generate predictions using the predict function, then call the table functions like this:

table(pred = predictions, true = testingdata$pop)
## Error: object 'predictions' not found

Answer Question 12.

Step 9: Sanity check the data

As a data scientist, you should never trust the data, especially if you did not collect it yourself. There is no such thing as clean data. You should always be trying to prove your results wrong by finding problems with the data. Richard Feynman calls it “bending over backwards to show how you're maybe wrong.” This is even more critical in data science, because almost by definition you are using someone else's data that was collected for some other purpose rather than the experiment you want to do. So of course it's going to have problems.

The measurements in this dataset are all supposed to be continuous (fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big), but one is not. Using plots or R code, figure out which field is corrupted.

 hist(data$fsc_small)

plot of chunk step9

 hist(data$fsc_perp)

plot of chunk step9

 hist(data$fsc_big)

plot of chunk step9

 hist(data$pe)

plot of chunk step9

 hist(data$chl_small)

plot of chunk step9

 hist(data$chl_big)

plot of chunk step9

 hist(data$fsc_big)

plot of chunk step9

There is more subtle issue with data as well. Plot time vs. chl_big, and you will notice a band of the data looks out of place. This band corresponds to data from a particular file for which the sensor may have been miscalibrated. Remove this data from the dataset by filtering out all data associated with file_id 208, then repeat the experiment for all three methods, making sure to split the dataset into training and test sets after filtering out the bad data.

data2=data[data$file_id!=208,]
ind <- createDataPartition(y=data1$pop, p=0.5, list=F)
## Error: object 'data1' not found
mytrain2 <- data2[ind,]
mytest2  <- data2[-ind,]
pred2 <- predict(model, newdata=mytest2,type="class")

mean(mytest2$pop==pred2)
## [1] 0.9669
mean(mytest$pop==pred)
## [1] 0.9207