This is a project of Introduction to Data Science (coursera), working with data from the SeaFlow environmental flow cytometry instrument.
A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, you can determine how large the particle is and some information about its color and other properties, allowing you to detect it.
The technology was developed for medical applciations, where the particles were potential pathogens in, say, serum, and the goal was to give a diagnosis. But the technology was adapted for use in environmental science to understand microbial population profiles.
The SeaFlow instrument, developed by the Armbrust Lab at the University of Washington, is unique in that it is deployed on research vessels and takes continuous measurements of population profiles in the open ocean.
Here the dataset represents a 21 minute sample from the vessel in a file seaflow_21min.csv. This sample has been pre-processed to remove the calibration “beads” that are passed through the system for monitoring, as well as some other particle types.
The columns of this dataset are as follows:
file_id, time, cell_id, d1, d2, fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big, pop
file_id: The data arrives in files, where each file represents a three-minute window; this field represents which file the data came from. The number is ordered by time, but is otherwise not significant.
time: This is an integer representing the time the particle passed through the instrument. Many particles may arrive at the same time; time is not a key for this relation.
cell_id: A unique identifier for each cell WITHIN a file. (file_id, cell_id) is a key for this relation.
d1, d2: Intensity of light at the two main sensors, oriented perpendicularly. These sensors are primarily used to determine whether the particles are properly centered in the stream. Used primarily in preprocesssing; they are unlikely to be useful for classification.
fsc_small, fsc_perp, fsc_big: Forward scatter small, perpendicular, and big. These values help distingish different sizes of particles.
pe: A measurement of phycoerythrin fluorescence, which is related to the wavelength associated with an orange color in microorganisms
chl_small, chl_big: Measurements related to the wavelength of light corresponding to chlorophyll.
pop: This is the class label assigned by the clustering mechanism used in the production system. It can be considered “ground truth” for the purposes of the assignment, but note that there are particles that cannot be unambiguously classified, so you should not aim for 100% accuracy. The values in this column are crypto, nano, pico, synecho, and ultra
data <- read.csv("seaflow_21min.csv")
summary(data)
## file_id time cell_id d1
## Min. :203 Min. : 12 Min. : 0 Min. : 1328
## 1st Qu.:204 1st Qu.:174 1st Qu.: 7486 1st Qu.: 7296
## Median :206 Median :362 Median :14995 Median :17728
## Mean :206 Mean :342 Mean :15008 Mean :17039
## 3rd Qu.:208 3rd Qu.:503 3rd Qu.:22401 3rd Qu.:24512
## Max. :209 Max. :643 Max. :32081 Max. :54048
## d2 fsc_small fsc_perp fsc_big
## Min. : 32 Min. :10005 Min. : 0 Min. :32384
## 1st Qu.: 9584 1st Qu.:31341 1st Qu.:13496 1st Qu.:32400
## Median :18512 Median :35483 Median :18069 Median :32400
## Mean :17437 Mean :34919 Mean :17646 Mean :32405
## 3rd Qu.:24656 3rd Qu.:39184 3rd Qu.:22243 3rd Qu.:32416
## Max. :54688 Max. :65424 Max. :63456 Max. :32464
## pe chl_small chl_big pop
## Min. : 0 Min. : 3485 Min. : 0 crypto : 102
## 1st Qu.: 1635 1st Qu.:22525 1st Qu.: 2800 nano :12698
## Median : 2421 Median :30512 Median : 7744 pico :20860
## Mean : 5325 Mean :30164 Mean : 8328 synecho:18146
## 3rd Qu.: 5854 3rd Qu.:38299 3rd Qu.:12880 ultra :20537
## Max. :58675 Max. :64832 Max. :57184
table(data$file_id)
##
## 203 204 205 206 207 208 209
## 10454 9435 8376 9215 11444 11995 11424
hist(data$time)
plot(data$file_id,data$cell_id)
plot(data$d1,data$d2)
plot(data$fsc_small,data$fsc_perp)
hist(data$fsc_big)
hist(data$pe)
plot(data$chl_small,data$chl_big)
table(data$pop)
##
## crypto nano pico synecho ultra
## 102 12698 20860 18146 20537
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
ind <- createDataPartition(y=data$pop, p=0.5, list=F)
mytrain <- data[ind,]
mytest <- data[-ind,]
mean(mytrain$time)
## [1] 340.9
library(ggplot2)
#qplot(x=chl_small, y= pe, data=data, fill=as.factor(pop))
ggplot(data, aes(chl_small, pe, color=pop))+geom_line()
f1 <- formula(pop~fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
#model <- train(f1, data)
library(rpart)
model <- rpart(f1, method="class", data=mytrain)
print(model)
## n= 36172
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 36172 25740 pico (0.0014 0.18 0.29 0.25 0.28)
## 2) pe< 5006 26334 15950 pico (0 0.22 0.39 3.8e-05 0.38)
## 4) chl_small< 3.265e+04 11962 2296 pico (0 0.0005 0.81 8.4e-05 0.19) *
## 5) chl_small>=3.265e+04 14372 6532 ultra (0 0.4 0.05 0 0.55)
## 10) chl_small>=4.129e+04 5186 669 nano (0 0.87 0 0 0.13) *
## 11) chl_small< 4.129e+04 9186 2015 ultra (0 0.14 0.078 0 0.78) *
## 3) pe>=5006 9838 766 synecho (0.0052 0.054 0.0045 0.92 0.014)
## 6) chl_small>=3.803e+04 655 138 nano (0.078 0.79 0 0.069 0.064) *
## 7) chl_small< 3.803e+04 9183 156 synecho (0 0.0015 0.0048 0.98 0.011) *
pred <- predict(model, newdata=mytest,type="class")
table(mytest$pop, pred)
## pred
## crypto nano pico synecho ultra
## crypto 0 51 0 0 0
## nano 0 5044 3 8 1294
## pico 0 1 9664 48 717
## synecho 0 39 4 9030 0
## ultra 0 713 2263 101 7191
mean(mytest$pop==pred)
## [1] 0.8551
Random forests can automatically generate an estimate of variable importance during training by permuting the values in a given variable and measuring the effect on classification. If scrambling the values has little effect on the model's ability to make predictions, then the variable must not be very important.
A random forest can obtain another estimate of variable importance based on the Gini impurity that we discussed in the lecture. The function importance(model) prints the mean decrease in gini importance for each variable. The higher the number, the more the gini impurity score decreases by branching on this variable, indicating that the variable is more important.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(f1, data=mytrain)
pred <- predict(model, newdata=mytest,type="class")
table(mytest$pop, pred)
## pred
## crypto nano pico synecho ultra
## crypto 51 0 0 0 0
## nano 0 5565 0 4 780
## pico 0 0 10112 9 309
## synecho 0 4 0 9068 1
## ultra 0 378 1345 7 8538
mean(mytest$pop==pred)
## [1] 0.9216
importance(model)
## MeanDecreaseGini
## fsc_small 2725.2
## fsc_perp 2095.3
## fsc_big 205.9
## pe 8900.5
## chl_big 4776.7
## chl_small 8151.1
Use the e1071 library and call model <- svm(fol, data=trainingdata).
Answer Question 11.
library(e1071)
model <- svm(f1, data=mytrain)
pred <- predict(model, newdata=mytest,type="class")
table(mytest$pop, pred)
## pred
## crypto nano pico synecho ultra
## crypto 47 1 0 3 0
## nano 1 5599 0 4 745
## pico 0 0 10062 60 308
## synecho 0 2 23 9048 0
## ultra 0 396 1317 9 8546
mean(mytest$pop==pred)
## [1] 0.9207
Use the table function to generate a confusion matrix for each of your three methods. Generate predictions using the predict function, then call the table functions like this:
table(pred = predictions, true = testingdata$pop)
## Error: object 'predictions' not found
Answer Question 12.
As a data scientist, you should never trust the data, especially if you did not collect it yourself. There is no such thing as clean data. You should always be trying to prove your results wrong by finding problems with the data. Richard Feynman calls it “bending over backwards to show how you're maybe wrong.” This is even more critical in data science, because almost by definition you are using someone else's data that was collected for some other purpose rather than the experiment you want to do. So of course it's going to have problems.
The measurements in this dataset are all supposed to be continuous (fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big), but one is not. Using plots or R code, figure out which field is corrupted.
hist(data$fsc_small)
hist(data$fsc_perp)
hist(data$fsc_big)
hist(data$pe)
hist(data$chl_small)
hist(data$chl_big)
hist(data$fsc_big)
There is more subtle issue with data as well. Plot time vs. chl_big, and you will notice a band of the data looks out of place. This band corresponds to data from a particular file for which the sensor may have been miscalibrated. Remove this data from the dataset by filtering out all data associated with file_id 208, then repeat the experiment for all three methods, making sure to split the dataset into training and test sets after filtering out the bad data.
data2=data[data$file_id!=208,]
ind <- createDataPartition(y=data1$pop, p=0.5, list=F)
## Error: object 'data1' not found
mytrain2 <- data2[ind,]
mytest2 <- data2[-ind,]
pred2 <- predict(model, newdata=mytest2,type="class")
mean(mytest2$pop==pred2)
## [1] 0.9669
mean(mytest$pop==pred)
## [1] 0.9207