Cloud database http://homes.cs.washington.edu/~billhowe/bigdatacloud/
# install.packages("caret")
# install.packages("rpart")
# install.packages("tree")
# install.packages("randomForest")
# install.packages("e1071")
# install.packages("ggplot2")
You are provided a dataset that represents a 21 minute sample from the vessel in a file seaflow_21min.csv. This sample has been pre-processed to remove the calibration “beads” that are passed through the system for monitoring, as well as some other particle types.
The columns of this dataset are as follows:
file_id, time, cell_id, d1, d2, fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big, pop
The goal of this assignment is to use R to experiment with a real dataset – including its idiosyncrasies – and to consider the strengths and weaknesses of a few popular methods for supervised learning.
Using R, read the file seaflow_21min.csv and get the overall counts for each category of particle. You may consider using the functions read.csv and summary.
setwd("E:\\Dropbox\\Github\\introdata\\datasci_course_materials\\assignment5")
myData <- read.csv("seaflow_21min.csv")
summary(myData)
## file_id time cell_id d1
## Min. :203.0 Min. : 12.0 Min. : 0 Min. : 1328
## 1st Qu.:204.0 1st Qu.:174.0 1st Qu.: 7486 1st Qu.: 7296
## Median :206.0 Median :362.0 Median :14995 Median :17728
## Mean :206.2 Mean :341.5 Mean :15008 Mean :17039
## 3rd Qu.:208.0 3rd Qu.:503.0 3rd Qu.:22401 3rd Qu.:24512
## Max. :209.0 Max. :643.0 Max. :32081 Max. :54048
## d2 fsc_small fsc_perp fsc_big
## Min. : 32 Min. :10005 Min. : 0 Min. :32384
## 1st Qu.: 9584 1st Qu.:31341 1st Qu.:13496 1st Qu.:32400
## Median :18512 Median :35483 Median :18069 Median :32400
## Mean :17437 Mean :34919 Mean :17646 Mean :32405
## 3rd Qu.:24656 3rd Qu.:39184 3rd Qu.:22243 3rd Qu.:32416
## Max. :54688 Max. :65424 Max. :63456 Max. :32464
## pe chl_small chl_big pop
## Min. : 0 Min. : 3485 Min. : 0 crypto : 102
## 1st Qu.: 1635 1st Qu.:22525 1st Qu.: 2800 nano :12698
## Median : 2421 Median :30512 Median : 7744 pico :20860
## Mean : 5325 Mean :30164 Mean : 8328 synecho:18146
## 3rd Qu.: 5854 3rd Qu.:38299 3rd Qu.:12880 ultra :20537
## Max. :58675 Max. :64832 Max. :57184
Question 1 How many particles labeled “synecho” are in the file provided?
18146
Question 2 What is the 3rd Quantile of the field fsc_small? (the summary function computes this on your behalf)
39184
Divide the data into two equal subsets, one for training and one for testing. Make sure to divide the data in an unbiased manner.
You might consider using either the createDataPartition function or the sample function, although there are many ways to achieve the goal.
Answer Question 3.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(3456)
inTrainingSet <- createDataPartition(myData$pop,p = 0.5,list=FALSE)
training <- myData[inTrainingSet,]
testing <- myData[-inTrainingSet,]
summary(seaflowTrain)
## Error: object 'seaflowTrain' not found
Question 3 What is the mean of the variable “time” for your training set?
341.6
Plot pe against chl_small and color by pop
I recommend using the function ggplot in the library ggpplot2 rather than using base R functions, but this is not required.
Answer Question 4.
Question 4 In the plot of pe vs. chl_small, the particles labeled ultra should appear to be somewhat “mixed” with two other populations of particles. Which two populations?
library(ggplot2)
p <- ggplot(myData, aes(x = chl_small, y = pe ) )
p + geom_line(aes(colour = pop))
ggplot(aes(x=chl_small,y=pe,color=pop),data=myData)+geom_jitter()
nano ultra pico
Install the rpart library if you do not have it, and load the library.
Many statistical models in R provide an interface of the form
model <- train(formula, dataframe)
You can then use the model object to make predictions by passing it to the predict function.
A formula object describes the relationship between the independent variables and the response variable, and can be expressed with the syntax
response ~ var1 + var2 + var3
and used with the formula function to construct the formula object itself:
fol <- formula(response ~ var1 + var2 + var3)
The rpart library uses this convention. Assuming your training data is in a data frame called training and you have constructed a formula objec tcalled fol, you can construct a decision tree using the following syntax (included here to avoid you struggling with a couple of unusual aspects of R):
model <- rpart(fol, method=“class”, data=training)
Train a tree as a function of the sensor measurements: fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small
Print the model object using the print function print(model)
library(rpart)
fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
model_1 <- rpart(fol, method="class", data=training)
print(model)
## Error: object 'model' not found
# partykit enchance the plotting functions for recursive partitioning.
# #install.packages("partykit")
library(partykit)
## Loading required package: grid
# #We can convert the rpart object to a new class called party and plot it to see more in the terminal nodes
rpart1a <- as.party(model_1)
#
plot(rpart1a)
#
library(rpart.plot)
rpart.plot(model_1,branch=0,branch.type=2,type=1,extra=102,shadow.col="pink",box.col="gray",split.col="magenta")
#rpart.plot(model_1,branch=0,branch.type=2,type=1,extra=102,shadow.col="pink",box.col="green",split.col="red")
The output is a set of decision nodes, one node per line. Each line is indented indicating the height of the tree.
To make a prediction, walk down the tree applying the predicates to determine which branch to take. For example, in this bogus tree, a particle with chl_small=25000 and fsc_perp=1000 would take branch 2, branch 4, then branch 10, and be classified as nano.
Answer Questions 5, 6, 7.
Question 5 Use print(model) to inspect your tree. Which populations, if any, is your tree incapable of recognizing? (Which populations do not appear on any branch?) (It's possible, but very unlikely, that an incorrect answer to this question is the result of improbable sampling.) Hint: Look
crypto
Question 6 Most trees will include a node near the root that applies a rule to the pe field, where particles with a value less than some threshold will descend down one branch, and particles with a value greater than some threshold will descend down a different branch. If you look at the plot you created previously, you can verify that the threshold used in the tree is evident visually. What is the value of the threshold on the pe field learned in your model?
5001.5
Question 7 Based on your decision tree, which variables appear to be most important in predicting the class population?
chl_small
Use the predict function to generate predictions on your test data. Then, compare these predictions with the class labels in the test data itself.
In R, if you write A==B and A and B are vectors, the result is a vector of 1s and 0s. The sum of this vector will be the number of correct predictions. Dividing this sum by the size of the test dataset will give you the accuracy.
Answer Question 8.
predict_1=predict(model_1,newdata=testing)
str(predict_1)
## num [1:36171, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:36171] "1" "2" "4" "7" ...
## ..$ : chr [1:5] "crypto" "nano" "pico" "synecho" ...
pop_test_1=c()
pop_names=c("crypto","nano","pico","synecho","ultra")
for (i in 1:nrow(predict_1)){
pop_test_1=c(pop_test_1,pop_names[which.max(predict_1[i,])])
}
result_1=as.vector(testing$pop)==pop_test_1
table(result_1)
## result_1
## FALSE TRUE
## 5196 30975
accuracy_1=sum(result_1)/length(pop_test_1)
# simple way
predict_1c=predict(model_1,newdata=testing,type= "class")
table(predict_1c, testing$pop)
##
## predict_1c crypto nano pico synecho ultra
## crypto 0 0 0 0 0
## nano 51 4990 1 35 663
## pico 0 2 9368 3 1919
## synecho 0 20 47 9035 104
## ultra 0 1337 1014 0 7582
confusionMatrix(predict_1c, testing$pop)
## Confusion Matrix and Statistics
##
## Reference
## Prediction crypto nano pico synecho ultra
## crypto 0 0 0 0 0
## nano 51 4990 1 35 663
## pico 0 2 9368 3 1919
## synecho 0 20 47 9035 104
## ultra 0 1337 1014 0 7582
##
## Overall Statistics
##
## Accuracy : 0.8563
## 95% CI : (0.8527, 0.8599)
## No Information Rate : 0.2884
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.806
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity 0.00000 0.7860 0.8982 0.9958
## Specificity 1.00000 0.9749 0.9253 0.9937
## Pos Pred Value NaN 0.8693 0.8296 0.9814
## Neg Pred Value 0.99859 0.9553 0.9573 0.9986
## Prevalence 0.00141 0.1755 0.2884 0.2508
## Detection Rate 0.00000 0.1380 0.2590 0.2498
## Detection Prevalence 0.00000 0.1587 0.3122 0.2545
## Balanced Accuracy 0.50000 0.8804 0.9117 0.9948
## Class: ultra
## Sensitivity 0.7384
## Specificity 0.9092
## Pos Pred Value 0.7633
## Neg Pred Value 0.8976
## Prevalence 0.2839
## Detection Rate 0.2096
## Detection Prevalence 0.2746
## Balanced Accuracy 0.8238
Accuracy : 0.8546
Load the randomForest library, then call randomForest using the formula object and the data, as you did to build a single tree:
install.packages("randomForest")
## Installing package into 'C:/Users/snowdj/Documents/R/win-library/3.0'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
model_6 <- randomForest(fol, data=training)
Evaluate this model on the test data the same way you did for the tree.
Answer Question 9.
Question 9 What was the accuracy of your random forest model on the test data? Enter a number between 0 and 1.
# predict_9=predict(model_6,newdata=testing)
#
# str(predict_1)
#
# pop_test_1=c()
# pop_names=c("crypto","nano","pico","synecho","ultra")
# for (i in 1:nrow(predict_1)){
# pop_test_1=c(pop_test_1,pop_names[which.max(predict_1[i,])])
# }
# result_1=as.vector(testing$pop)==pop_test_1
# table(result_1)
# accuracy_1=sum(result_1)/length(pop_test_1)
# simple way
predict_9c=predict(model_6,newdata=testing,type= "class")
table(predict_9c, testing$pop)
##
## predict_9c crypto nano pico synecho ultra
## crypto 51 2 0 1 0
## nano 0 5541 0 4 357
## pico 0 0 10060 0 1365
## synecho 0 2 10 9068 10
## ultra 0 804 360 0 8536
confusionMatrix(predict_9c, testing$pop)
## Confusion Matrix and Statistics
##
## Reference
## Prediction crypto nano pico synecho ultra
## crypto 51 2 0 1 0
## nano 0 5541 0 4 357
## pico 0 0 10060 0 1365
## synecho 0 2 10 9068 10
## ultra 0 804 360 0 8536
##
## Overall Statistics
##
## Accuracy : 0.9194
## 95% CI : (0.9166, 0.9222)
## No Information Rate : 0.2884
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8913
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity 1.000000 0.8727 0.9645 0.9994
## Specificity 0.999917 0.9879 0.9470 0.9992
## Pos Pred Value 0.944444 0.9388 0.8805 0.9976
## Neg Pred Value 1.000000 0.9733 0.9850 0.9998
## Prevalence 0.001410 0.1755 0.2884 0.2508
## Detection Rate 0.001410 0.1532 0.2781 0.2507
## Detection Prevalence 0.001493 0.1632 0.3159 0.2513
## Balanced Accuracy 0.999958 0.9303 0.9557 0.9993
## Class: ultra
## Sensitivity 0.8313
## Specificity 0.9551
## Pos Pred Value 0.8800
## Neg Pred Value 0.9346
## Prevalence 0.2839
## Detection Rate 0.2360
## Detection Prevalence 0.2682
## Balanced Accuracy 0.8932
Random forests can automatically generate an estimate of variable importance during training by permuting the values in a given variable and measuring the effect on classification. If scrambling the values has little effect on the model's ability to make predictions, then the variable must not be very important.
A random forest can obtain another estimate of variable importance based on the Gini impurity that we discussed in the lecture. The function importance(model) prints the mean decrease in gini importance for each variable. The higher the number, the more the gini impurity score decreases by branching on this variable, indicating that the variable is more important.
Call this function and answer Question 10.
importance(model_6)
## MeanDecreaseGini
## fsc_small 2716.6570
## fsc_perp 2131.8252
## fsc_big 198.1133
## pe 8856.7270
## chl_big 4829.5931
## chl_small 8120.4793
Use the e1071 library and call model <- svm(fol, data=trainingdata).
Answer Question 11.
library(e1071)
model_svm <- svm(fol, data=training)
Question 11 What was the accuracy of your support vector machine model on the test data? Enter a number between 0 and 1
0.9214
predict_11c=predict(model_svm,newdata=testing,type= "class")
table(predict_11c, testing$pop)
##
## predict_11c crypto nano pico synecho ultra
## crypto 50 5 0 1 0
## nano 0 5615 0 2 420
## pico 0 0 10043 25 1345
## synecho 1 2 62 9044 6
## ultra 0 727 325 1 8497
confusionMatrix(predict_11c, testing$pop)
## Confusion Matrix and Statistics
##
## Reference
## Prediction crypto nano pico synecho ultra
## crypto 50 5 0 1 0
## nano 0 5615 0 2 420
## pico 0 0 10043 25 1345
## synecho 1 2 62 9044 6
## ultra 0 727 325 1 8497
##
## Overall Statistics
##
## Accuracy : 0.9192
## 95% CI : (0.9164, 0.922)
## No Information Rate : 0.2884
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8911
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity 0.980392 0.8844 0.9629 0.9968
## Specificity 0.999834 0.9858 0.9468 0.9974
## Pos Pred Value 0.892857 0.9301 0.8800 0.9922
## Neg Pred Value 0.999972 0.9756 0.9844 0.9989
## Prevalence 0.001410 0.1755 0.2884 0.2508
## Detection Rate 0.001382 0.1552 0.2777 0.2500
## Detection Prevalence 0.001548 0.1669 0.3155 0.2520
## Balanced Accuracy 0.990113 0.9351 0.9548 0.9971
## Class: ultra
## Sensitivity 0.8275
## Specificity 0.9593
## Pos Pred Value 0.8897
## Neg Pred Value 0.9335
## Prevalence 0.2839
## Detection Rate 0.2349
## Detection Prevalence 0.2640
## Balanced Accuracy 0.8934
Use the table function to generate a confusion matrix for each of your three methods. Generate predictions using the predict function, then call the table functions like this:
table(pred = predictions, true = testingdata$pop)
Answer Question 12.
Question 12 Construct a confusion matrix for each of the three methods using the table function. What appears to be the most common error the models make?
pico is mistaken for ultra
As a data scientist, you should never trust the data, especially if you did not collect it yourself. There is no such thing as clean data. You should always be trying to prove your results wrong by finding problems with the data. Richard Feynman calls it “bending over backwards to show how you're maybe wrong.” This is even more critical in data science, because almost by definition you are using someone else's data that was collected for some other purpose rather than the experiment you want to do. So of course it's going to have problems.
The measurements in this dataset are all supposed to be continuous (fsc_small, fsc_perp, fsc_big, pe, chl_small, chl_big), but one is not. Using plots or R code, figure out which field is corrupted.
Answer Question 13
Question 13 The variables in the dataset were assumed to be continuous, but one of them takes on only a few discrete values, suggesting a problem. Which variable exhibits this problem?
library(ggplot2)
p13 <- ggplot(myData, aes(x = time , y = fsc_big ) )
p13 + geom_jitter(aes(color=pop))
ggplot(aes(x=time,y=pe,color=pop),data=myData)+geom_jitter()
There is more subtle issue with data as well. Plot time vs. chl_big, and you will notice a band of the data looks out of place. This band corresponds to data from a particular file for which the sensor may have been miscalibrated. Remove this data from the dataset by filtering out all data associated with file_id 208, then repeat the experiment for all three methods, making sure to split the dataset into training and test sets after filtering out the bad data.
Answer Question 14
library(ggplot2)
p14 <- ggplot(myData, aes(x = chl_big , y = time ) )
p14 + geom_jitter()
ggplot(aes(x=chl_small,y=pe,color=pop),data=myData)+geom_jitter()
Question 14 After removing data associated with file_id 208, what was the effect on the accuracy of your svm model? Enter a positive or negative number representing the net change in accuracy, where a positive number represents an improvement in accuracy and a negative number represents a decrease in accuracy.
Answer Question 14.
library(e1071)
myDataFilted<- subset(myData,file_id != 208)
set.seed(3456)
inTrainingSet <- createDataPartition(myDataFilted$pop,p = 0.5,list=FALSE)
training <- myDataFilted[inTrainingSet,]
testing <- myDataFilted[-inTrainingSet,]
summary(training)
## file_id time cell_id d1
## Min. :203.0 Min. : 12.0 Min. : 0 Min. : 1488
## 1st Qu.:204.0 1st Qu.:146.0 1st Qu.: 7451 1st Qu.: 7336
## Median :206.0 Median :303.0 Median :14772 Median :17792
## Mean :205.8 Mean :308.4 Mean :14803 Mean :17123
## 3rd Qu.:207.0 3rd Qu.:435.0 3rd Qu.:22081 3rd Qu.:24624
## Max. :209.0 Max. :643.0 Max. :31307 Max. :50896
## d2 fsc_small fsc_perp fsc_big
## Min. : 32 Min. :10011 Min. : 0 Min. :32384
## 1st Qu.: 9568 1st Qu.:31383 1st Qu.:13483 1st Qu.:32400
## Median :18512 Median :35515 Median :18096 Median :32400
## Mean :17423 Mean :34942 Mean :17658 Mean :32405
## 3rd Qu.:24624 3rd Qu.:39277 3rd Qu.:22339 3rd Qu.:32416
## Max. :54688 Max. :65424 Max. :62173 Max. :32464
## pe chl_small chl_big pop
## Min. : 0 Min. : 3485 Min. : 0 crypto : 44
## 1st Qu.: 1627 1st Qu.:23005 1st Qu.: 2992 nano :4784
## Median : 2379 Median :30768 Median : 7872 pico :9964
## Mean : 5201 Mean :30406 Mean : 8426 synecho:7124
## 3rd Qu.: 5216 3rd Qu.:38474 3rd Qu.:12944 ultra :8259
## Max. :58675 Max. :64805 Max. :57184
model_svm <- svm(fol, data=training)
Question 14 What was the accuracy of your support vector machine model on the test data? Enter a number between 0 and 1
# type = "class" is crutial for table
predict_11c=predict(model_svm,newdata=testing,type= "class")
table(predict_11c, testing$pop)
##
## predict_11c crypto nano pico synecho ultra
## crypto 40 2 0 0 0
## nano 0 4570 0 1 205
## pico 0 0 9723 23 187
## synecho 3 3 51 7097 2
## ultra 0 209 190 3 7864
confusionMatrix(predict_11c, testing$pop)
## Confusion Matrix and Statistics
##
## Reference
## Prediction crypto nano pico synecho ultra
## crypto 40 2 0 0 0
## nano 0 4570 0 1 205
## pico 0 0 9723 23 187
## synecho 3 3 51 7097 2
## ultra 0 209 190 3 7864
##
## Overall Statistics
##
## Accuracy : 0.9709
## 95% CI : (0.9689, 0.9727)
## No Information Rate : 0.3302
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9604
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: crypto Class: nano Class: pico Class: synecho
## Sensitivity 0.930233 0.9553 0.9758 0.9962
## Specificity 0.999934 0.9919 0.9896 0.9974
## Pos Pred Value 0.952381 0.9569 0.9789 0.9918
## Neg Pred Value 0.999900 0.9916 0.9881 0.9988
## Prevalence 0.001425 0.1586 0.3302 0.2361
## Detection Rate 0.001326 0.1515 0.3222 0.2352
## Detection Prevalence 0.001392 0.1583 0.3292 0.2372
## Balanced Accuracy 0.965083 0.9736 0.9827 0.9968
## Class: ultra
## Sensitivity 0.9523
## Specificity 0.9817
## Pos Pred Value 0.9514
## Neg Pred Value 0.9820
## Prevalence 0.2737
## Detection Rate 0.2606
## Detection Prevalence 0.2740
## Balanced Accuracy 0.9670
0.9716-0.9214
## [1] 0.0502
0.0502