Even when the air seems to be completely clear, there are many particles in our atmosphere. The data are generated to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Here we want to classify gamma particles using Naive Bayes, Decision Tree and Random Forest. Using the data that was downloaded from kaggle about MAGIC Gamma Telescope Dataset.
Some relevant columns in the data :
- fLength : major axis of ellipse [mm]
- fWidth : minor axis of ellipse [mm]
- fSize : 10-log of sum of content of all pixels [in #phot]
- fConc : ratio of sum of two highest pixels over fSize [ratio]
- fConc1 : ratio of highest pixel over fSize [ratio]
- fAsym : distance from highest pixel to center, projected onto major axis [mm] - fM3Long : 3rd root of third moment along major axis [mm]
- fM3Trans : 3rd root of third moment along minor axis [mm]
- fAlpha : angle of major axis with vector to origin [deg]
- fDist : distance from origin to center of ellipse [mm]
- class : gamma (signal), hadron (background)
library(caret)## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(e1071)## Warning: package 'e1071' was built under R version 4.0.5
library(partykit)## Warning: package 'partykit' was built under R version 4.0.5
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.0.5
## Loading required package: mvtnorm
library(ROCR)## Warning: package 'ROCR' was built under R version 4.0.5
library(rsample)## Warning: package 'rsample' was built under R version 4.0.5
##
## Attaching package: 'rsample'
## The following object is masked from 'package:e1071':
##
## permutations
tele <- read.csv("data/telescope_data.csv")head(tele)## X fLength fWidth fSize fConc fConc1 fAsym fM3Long fM3Trans fAlpha
## 1 0 28.7967 16.0021 2.6449 0.3918 0.1982 27.7004 22.0110 -8.2027 40.0920
## 2 1 31.6036 11.7235 2.5185 0.5303 0.3773 26.2722 23.8238 -9.9574 6.3609
## 3 2 162.0520 136.0310 4.0612 0.0374 0.0187 116.7410 -64.8580 -45.2160 76.9600
## 4 3 23.8172 9.5728 2.3385 0.6147 0.3922 27.2107 -6.4633 -7.1513 10.4490
## 5 4 75.1362 30.9205 3.1611 0.3168 0.1832 -5.5277 28.5525 21.8393 4.6480
## 6 5 51.6240 21.1502 2.9085 0.2420 0.1340 50.8761 43.1887 9.8145 3.6130
## fDist class
## 1 81.8828 g
## 2 205.2610 g
## 3 256.7880 g
## 4 116.7370 g
## 5 356.4620 g
## 6 238.0980 g
tail(tele)## X fLength fWidth fSize fConc fConc1 fAsym fM3Long fM3Trans
## 19015 19014 43.2980 17.3545 2.8307 0.2877 0.1646 -60.1842 -33.8513 -3.6545
## 19016 19015 21.3846 10.9170 2.6161 0.5857 0.3934 15.2618 11.5245 2.8766
## 19017 19016 28.9452 6.7020 2.2672 0.5351 0.2784 37.0816 13.1853 -2.9632
## 19018 19017 75.4455 47.5305 3.4483 0.1417 0.0549 -9.3561 41.0562 -9.4662
## 19019 19018 120.5135 76.9018 3.9939 0.0944 0.0683 5.8043 -93.5224 -63.8389
## 19020 19019 187.1814 53.0014 3.2093 0.2876 0.1539 -167.3125 -168.4558 31.4755
## fAlpha fDist class
## 19015 78.4099 224.8299 h
## 19016 2.4229 106.8258 h
## 19017 86.7975 247.4560 h
## 19018 30.2987 256.5166 h
## 19019 84.6874 408.3166 h
## 19020 52.7310 272.3174 h
To check if there is any missing value from the data.
anyNA(tele)## [1] FALSE
colSums(is.na(tele))## X fLength fWidth fSize fConc fConc1 fAsym fM3Long
## 0 0 0 0 0 0 0 0
## fM3Trans fAlpha fDist class
## 0 0 0 0
There is no missing value from this data.
Check data type for each column.
glimpse(tele)## Rows: 19,020
## Columns: 12
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ fLength <dbl> 28.7967, 31.6036, 162.0520, 23.8172, 75.1362, 51.6240, 48.246~
## $ fWidth <dbl> 16.0021, 11.7235, 136.0310, 9.5728, 30.9205, 21.1502, 17.3565~
## $ fSize <dbl> 2.6449, 2.5185, 4.0612, 2.3385, 3.1611, 2.9085, 3.0332, 2.552~
## $ fConc <dbl> 0.3918, 0.5303, 0.0374, 0.6147, 0.3168, 0.2420, 0.2529, 0.423~
## $ fConc1 <dbl> 0.1982, 0.3773, 0.0187, 0.3922, 0.1832, 0.1340, 0.1515, 0.217~
## $ fAsym <dbl> 27.7004, 26.2722, 116.7410, 27.2107, -5.5277, 50.8761, 8.5730~
## $ fM3Long <dbl> 22.0110, 23.8238, -64.8580, -6.4633, 28.5525, 43.1887, 38.095~
## $ fM3Trans <dbl> -8.2027, -9.9574, -45.2160, -7.1513, 21.8393, 9.8145, 10.5868~
## $ fAlpha <dbl> 40.0920, 6.3609, 76.9600, 10.4490, 4.6480, 3.6130, 4.7920, 0.~
## $ fDist <dbl> 81.8828, 205.2610, 256.7880, 116.7370, 356.4620, 238.0980, 21~
## $ class <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "~
tele <- tele %>%
select(-X) %>%
mutate(class = as.factor(class))Before we use the data, we need to check the proportion between the data about gamma and hardon.
tele$class %>%
table() %>%
prop.table()## .
## g h
## 0.6483701 0.3516299
The proportion of the data is adequate, so we can use the data as it is.
Here we want to split the data to make a train data and a test data. A train data is used to make modeling and the test data is used to test the model that we make using the train data.
set.seed(100)
splitted <- initial_split(data = tele, prop = 0.8, strata = "class")
tele_train <- training(splitted)
tele_test <- testing(splitted)We will check again the proportion of the data.
tele_train$class %>%
table() %>%
prop.table()## .
## g h
## 0.6483538 0.3516462
tele_test$class %>%
table() %>%
prop.table()## .
## g h
## 0.6484354 0.3515646
Since the train data and the test data is adequate we can proceed to the next step.
The first model we will make is using Naive Bayes.
model_naive <- naiveBayes(class ~ .,
data = tele_train,
laplace = 1)After we make the model, we will use the model to predict the test data to confirm the accuracy of the model. And we will save the predicted result in a column in the test data.
tele_test$predicted <- predict(object = model_naive,
newdata = tele_test,
type = "class")
tele_test %>%
select(class, predicted) %>%
head(10)## class predicted
## 1 g g
## 6 g g
## 7 g g
## 21 g h
## 23 g g
## 25 g g
## 26 g h
## 28 g g
## 30 g g
## 44 g g
We will evaluate the model using confusion matrix.
confusionMatrix(data = tele_test$predicted,
reference = tele_test$class,
positive = "g")## Confusion Matrix and Statistics
##
## Reference
## Prediction g h
## g 2260 819
## h 206 518
##
## Accuracy : 0.7305
## 95% CI : (0.7161, 0.7445)
## No Information Rate : 0.6484
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3395
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9165
## Specificity : 0.3874
## Pos Pred Value : 0.7340
## Neg Pred Value : 0.7155
## Prevalence : 0.6484
## Detection Rate : 0.5943
## Detection Prevalence : 0.8096
## Balanced Accuracy : 0.6519
##
## 'Positive' Class : g
##
The result from the confusion matrix above shows that our naive bayes model has accuracy of 73.05 % on the test data, the meaning is that 73.05 % of our data is correctly classified.
Next we will make a model using Decision Tree.
model_tree <- ctree(class ~ .,
data = tele_train)
plot(model_tree, type = "simple")After we make the model, we will use the model to predict the test data to confirm the accuracy of the model.
pred_tree <- predict(model_tree,
newdata = tele_test,
type = "response")We will evaluate the model using confusion matrix.
confusionMatrix(data = pred_tree,
reference = tele_test$class,
positive = "g")## Confusion Matrix and Statistics
##
## Reference
## Prediction g h
## g 2306 403
## h 160 934
##
## Accuracy : 0.852
## 95% CI : (0.8403, 0.8631)
## No Information Rate : 0.6484
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6612
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9351
## Specificity : 0.6986
## Pos Pred Value : 0.8512
## Neg Pred Value : 0.8537
## Prevalence : 0.6484
## Detection Rate : 0.6064
## Detection Prevalence : 0.7123
## Balanced Accuracy : 0.8168
##
## 'Positive' Class : g
##
The result from the confusion matrix above shows that our decision tree model has accuracy of 85.20 % on the test data, the meaning is that 85.20 % of our data is correctly classified.
Lastly, we will make a model using random forest.
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 4, repeats = 4)forest <- train(class ~ .,
data = tele_train,
method = "rf",
trControl = ctrl)
forest## Random Forest
##
## 15217 samples
## 10 predictor
## 2 classes: 'g', 'h'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times)
## Summary of sample sizes: 11412, 11413, 11412, 11414, 11412, 11412, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8770455 0.7224043
## 6 0.8767169 0.7221868
## 10 0.8756983 0.7196178
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(forest)varImp(forest)## rf variable importance
##
## Overall
## fAlpha 100.0000
## fLength 47.3335
## fWidth 37.5841
## fSize 27.1045
## fM3Long 19.0356
## fDist 5.6099
## fConc 3.9262
## fAsym 1.1624
## fConc1 0.9358
## fM3Trans 0.0000
From the model summary, we got the optimum number of variables considered for splitting at each tree node is 2 and fAlpha is the one who influenced the most of the model. Then we will see the out-of-bag (OOB) error rate.
plot(forest$finalModel)
legend("topright",
colnames(forest$finalModel$err.rate),
col=1:6,cex=0.8,fill=1:6)forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 12.2%
## Confusion matrix:
## g h class.error
## g 9266 600 0.06081492
## h 1256 4095 0.23472248
The OOB error rate is 12.20 %. Next, we will use the model to predict the test data.
pred_forest <- predict(forest,
newdata = tele_test,
type = "raw")
head(pred_forest)## [1] g g g h g g
## Levels: g h
We will evaluate the model using confusion matrix.
confusionMatrix(data = pred_forest,
reference = tele_test$class,
positive = "g")## Confusion Matrix and Statistics
##
## Reference
## Prediction g h
## g 2348 312
## h 118 1025
##
## Accuracy : 0.8869
## 95% CI : (0.8764, 0.8968)
## No Information Rate : 0.6484
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7435
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9521
## Specificity : 0.7666
## Pos Pred Value : 0.8827
## Neg Pred Value : 0.8968
## Prevalence : 0.6484
## Detection Rate : 0.6174
## Detection Prevalence : 0.6994
## Balanced Accuracy : 0.8594
##
## 'Positive' Class : g
##
The result from the confusion matrix above shows that our random forest model has accuracy of 88.69 % on the test data, the meaning is that 88.69 % of our data is correctly classified.
data.frame("Model" = c("Naive Bayes", "Decision Tree", "Random Forest"),
"Accuracy" = c(73.05, 85.20, 88.69),
"Sensitivity" = c(91.65, 93.51, 95.21),
"Specificity" = c(38.74, 69.86, 76.66),
"Pos Pred Value" = c(73.40, 85.12, 88.27))## Model Accuracy Sensitivity Specificity Pos.Pred.Value
## 1 Naive Bayes 73.05 91.65 38.74 73.40
## 2 Decision Tree 85.20 93.51 69.86 85.12
## 3 Random Forest 88.69 95.21 76.66 88.27
From all the model and the test that we already done before, we can conclude that using Random Forest model is the best option that we can get. It means that the next time we need to classify Gamma particles using the data that we got from Cherenkov gamma telescope we can use the Random Forest model to classify it.