In this assignment, you will be working with data from the SeaFlow environmental flow cytometry instrument.
A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, you can determine how large the particle is and some information about its color and other properties, allowing you to detect it.
The technology was developed for medical applciations, where the particles were potential pathogens in, say, serum, and the goal was to give a diagnosis. But the technology was adapted for use in environmental science to understand microbial population profiles.
The SeaFlow instrument, developed by the Armbrust Lab at the University of Washington, is unique in that it is deployed on research vessels and takes continuous measurements of population profiles in the open ocean.
The scale of the data can be quite large, and is expected to grow significantly: A two-week cruise from one vessel can generate hundreds of gigabytes per day, and the vision is to deploy one of these instruments on not only research vessels but the commercial shipping fleet as well.
While there are a number of challenging analytics tasks associated with this data, a central task is classification of particles. Based on the optical measurements of the particle, it can be identified as one of several populations.
Using R, read the file seaflow_21min.csv and get the overall counts for each category of particle.
library("knitr")
data <- read.csv("https://raw.githubusercontent.com/uwescience/datasci_course_materials/master/assignment5/seaflow_21min.csv")
summary(data$fsc_small)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10005 31341 35483 34919 39184 65424
table(data$pop)
##
## crypto nano pico synecho ultra
## 102 12698 20860 18146 20537
file_id: The data arrives in files, where each file represents a three-minute window; this field represents which file the data came from. The number is ordered by time, but is otherwise not significant.
time: This is an integer representing the time the particle passed through the instrument. Many particles may arrive at the same time; time is not a key for this relation.
cell_id: A unique identifier for each cell WITHIN a file. (file_id, cell_id) is a key for this relation.
d1, d2: Intensity of light at the two main sensors, oriented perpendicularly. These sensors are primarily used to determine whether the particles are properly centered in the stream. Used primarily in preprocesssing; they are unlikely to be useful for classification.
fsc_small, fsc_perp, fsc_big: Forward scatter small, perpendicular, and big. These values help distingish different sizes of particles.
pe: A measurement of phycoerythrin fluorescence, which is related to the wavelength associated with an orange color in microorganisms
chl_small, chl_big: Measurements related to the wavelength of light corresponding to chlorophyll.
#install.packages("caret")
library(caret)
set.seed(3000)
trainIndex <- createDataPartition(data$pop, p = .5,
list = FALSE)
Train <- data[ trainIndex,]
Test <- data[-trainIndex,]
summary(Train)
## file_id time cell_id d1
## Min. :203.0 Min. : 12.0 Min. : 0 Min. : 1328
## 1st Qu.:204.0 1st Qu.:174.0 1st Qu.: 7456 1st Qu.: 7312
## Median :206.0 Median :362.0 Median :14920 Median :17648
## Mean :206.2 Mean :341.4 Mean :14974 Mean :17028
## 3rd Qu.:208.0 3rd Qu.:503.0 3rd Qu.:22370 3rd Qu.:24496
## Max. :209.0 Max. :643.0 Max. :32078 Max. :54048
## d2 fsc_small fsc_perp fsc_big
## Min. : 32 Min. :10016 Min. : 0 Min. :32384
## 1st Qu.: 9584 1st Qu.:31291 1st Qu.:13448 1st Qu.:32400
## Median :18432 Median :35456 Median :18019 Median :32400
## Mean :17409 Mean :34899 Mean :17631 Mean :32405
## 3rd Qu.:24640 3rd Qu.:39200 3rd Qu.:22262 3rd Qu.:32416
## Max. :54096 Max. :65424 Max. :62173 Max. :32464
## pe chl_small chl_big pop
## Min. : 0 Min. : 3675 Min. : 0 crypto : 51
## 1st Qu.: 1635 1st Qu.:22565 1st Qu.: 2784 nano : 6349
## Median : 2408 Median :30519 Median : 7744 pico :10430
## Mean : 5328 Mean :30182 Mean : 8331 synecho: 9073
## 3rd Qu.: 5885 3rd Qu.:38305 3rd Qu.:12832 ultra :10269
## Max. :58627 Max. :64805 Max. :52608
split the data into 20/80
library(ggplot2)
ggplot(Train, aes(x=pe,y=chl_small,color=pop)) + geom_point()
ggplot(Test, aes(x=pe,y=chl_small,color=pop)) + geom_point()
head(Train)
## file_id time cell_id d1 d2 fsc_small fsc_perp fsc_big pe
## 1 203 12 1 25344 27968 34677 14944 32400 2216
## 4 203 12 9 7712 14528 28744 10219 32416 1248
## 5 203 12 11 30368 21440 28861 6101 32400 12989
## 6 203 12 15 30032 22704 31221 13488 32400 1883
## 8 203 12 23 17008 7072 38987 20315 32416 1509
## 10 203 12 26 31312 22608 34133 21829 32400 2525
## chl_small chl_big pop
## 1 28237 5072 pico
## 4 35392 10704 ultra
## 5 23421 5920 synecho
## 6 27323 6560 pico
## 8 36680 15072 ultra
## 10 28205 2336 pico
library(rpart)
fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_small + chl_big)
dt_model <- rpart(fol, method="class", data=Train)
print(dt_model)
## n= 36172
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 36172 25742 pico (0.0014 0.18 0.29 0.25 0.28)
## 2) pe< 5001.5 26343 15958 pico (0 0.22 0.39 0 0.38)
## 4) chl_small< 32596 11828 2222 pico (0 0.00034 0.81 0 0.19) *
## 5) chl_small>=32596 14515 6595 ultra (0 0.4 0.054 0 0.55)
## 10) chl_small>=41292 5192 658 nano (0 0.87 0.00019 0 0.13) *
## 11) chl_small< 41292 9323 2060 ultra (0 0.14 0.083 0 0.78) *
## 3) pe>=5001.5 9829 756 synecho (0.0052 0.054 0.0046 0.92 0.013)
## 6) chl_small>=38113.5 655 134 nano (0.078 0.8 0 0.063 0.064) *
## 7) chl_small< 38113.5 9174 142 synecho (0 0.00087 0.0049 0.98 0.0097) *
dt_predict <- predict(dt_model, newdata=Test, type="class")
dt_result = dt_predict == Test$pop
summary(dt_result)
## Mode FALSE TRUE
## logical 5283 30888
#0.856
For example, in this bogus tree, a particle with chl_small=25000 and pe=2000 would take branch 2, branch 4, and be classified as pico. Crypto population is not recognizing. chl_small is the most important vairable in predicting the class population.
library(randomForest)
rf_model <- randomForest(fol, data=Train)
rf_predict = predict(rf_model, newdata=Test)
rf_result = rf_predict == Test$pop
summary(rf_result)
## Mode FALSE TRUE
## logical 2882 33289
importance(rf_model)
## MeanDecreaseGini
## fsc_small 2712.2526
## fsc_perp 2107.5250
## fsc_big 201.5438
## pe 9003.0716
## chl_small 8002.4482
## chl_big 4827.8646
#0.9207
Appear most important in terms of gini impurity measure are pe and chl_small
library(e1071)
svm_model = svm(fol, data=Train)
svm_predict = predict(svm_model, newdata=Test)
svm_result = svm_predict == Test$pop
summary(svm_result)
## Mode FALSE TRUE
## logical 2883 33288
accuracy is 0.9194.
table(pred = dt_predict, true = Test$pop) # Decision tree
## true
## pred crypto nano pico synecho ultra
## crypto 0 0 0 0 0
## nano 51 5017 0 41 720
## pico 0 3 9663 0 2258
## synecho 0 16 47 9032 114
## ultra 0 1313 720 0 7176
table(pred = rf_predict, true = Test$pop) # Random Forest
## true
## pred crypto nano pico synecho ultra
## crypto 51 0 0 0 0
## nano 0 5551 0 1 346
## pico 0 0 10044 0 1341
## synecho 0 2 10 9070 8
## ultra 0 796 376 2 8573
table(pred = svm_predict, true = Test$pop) # Support Vector Machine
## true
## pred crypto nano pico synecho ultra
## crypto 48 0 0 1 0
## nano 1 5627 0 2 395
## pico 0 0 10037 32 1322
## synecho 2 2 62 9031 6
## ultra 0 720 331 7 8545
plot(data$chl_big, data$chl_small)
plot(data$fsc_big, data$fsc_small)
plot(data$fsc_perp, data$pe)
Ultra and Pico appears to be the most common error. We assumed variables were continuous, but fsc_big has lot of clustering.
newfol <- formula(pop ~ fsc_small + fsc_perp + pe + chl_big + chl_small)
data2 = subset(data, data$file_id != 208)
# Resample.
set.seed(3000)
trainIndex <- createDataPartition(data2$pop, p = .5,
list = FALSE)
Train2 <- data2[ trainIndex,]
Test2 <- data2[-trainIndex,]
library("e1071")
new_svm_model = svm(newfol, data=Train2)
new_svm_predict = predict(new_svm_model, newdata=Test2)
new_svm_result = new_svm_predict == Test2$pop
summary(new_svm_result)
## Mode FALSE TRUE
## logical 825 29348
Difference in accuracy 0.9725 - 0.9194 = 0.053