1 Explanation

Even when the air seems to be completely clear, there are many particles in our atmosphere. The data are generated to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Here we want to classify gamma particles using Naive Bayes, Decision Tree and Random Forest. Using the data that was downloaded from kaggle about MAGIC Gamma Telescope Dataset.

Some relevant columns in the data :
- fLength : major axis of ellipse [mm]
- fWidth : minor axis of ellipse [mm]
- fSize : 10-log of sum of content of all pixels [in #phot]
- fConc : ratio of sum of two highest pixels over fSize [ratio]
- fConc1 : ratio of highest pixel over fSize [ratio]
- fAsym : distance from highest pixel to center, projected onto major axis [mm] - fM3Long : 3rd root of third moment along major axis [mm]
- fM3Trans : 3rd root of third moment along minor axis [mm]
- fAlpha : angle of major axis with vector to origin [deg]
- fDist : distance from origin to center of ellipse [mm]
- class : gamma (signal), hadron (background)

2 Data Preparation

2.1 Attaching Packages

library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.5
library(partykit)
## Warning: package 'partykit' was built under R version 4.0.5
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.0.5
## Loading required package: mvtnorm
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.0.5
library(rsample)
## Warning: package 'rsample' was built under R version 4.0.5
## 
## Attaching package: 'rsample'
## The following object is masked from 'package:e1071':
## 
##     permutations

2.2 Input Data

tele <- read.csv("data/telescope_data.csv")

2.3 Data Inspection

head(tele)
##   X  fLength   fWidth  fSize  fConc fConc1    fAsym  fM3Long fM3Trans  fAlpha
## 1 0  28.7967  16.0021 2.6449 0.3918 0.1982  27.7004  22.0110  -8.2027 40.0920
## 2 1  31.6036  11.7235 2.5185 0.5303 0.3773  26.2722  23.8238  -9.9574  6.3609
## 3 2 162.0520 136.0310 4.0612 0.0374 0.0187 116.7410 -64.8580 -45.2160 76.9600
## 4 3  23.8172   9.5728 2.3385 0.6147 0.3922  27.2107  -6.4633  -7.1513 10.4490
## 5 4  75.1362  30.9205 3.1611 0.3168 0.1832  -5.5277  28.5525  21.8393  4.6480
## 6 5  51.6240  21.1502 2.9085 0.2420 0.1340  50.8761  43.1887   9.8145  3.6130
##      fDist class
## 1  81.8828     g
## 2 205.2610     g
## 3 256.7880     g
## 4 116.7370     g
## 5 356.4620     g
## 6 238.0980     g
tail(tele)
##           X  fLength  fWidth  fSize  fConc fConc1     fAsym   fM3Long fM3Trans
## 19015 19014  43.2980 17.3545 2.8307 0.2877 0.1646  -60.1842  -33.8513  -3.6545
## 19016 19015  21.3846 10.9170 2.6161 0.5857 0.3934   15.2618   11.5245   2.8766
## 19017 19016  28.9452  6.7020 2.2672 0.5351 0.2784   37.0816   13.1853  -2.9632
## 19018 19017  75.4455 47.5305 3.4483 0.1417 0.0549   -9.3561   41.0562  -9.4662
## 19019 19018 120.5135 76.9018 3.9939 0.0944 0.0683    5.8043  -93.5224 -63.8389
## 19020 19019 187.1814 53.0014 3.2093 0.2876 0.1539 -167.3125 -168.4558  31.4755
##        fAlpha    fDist class
## 19015 78.4099 224.8299     h
## 19016  2.4229 106.8258     h
## 19017 86.7975 247.4560     h
## 19018 30.2987 256.5166     h
## 19019 84.6874 408.3166     h
## 19020 52.7310 272.3174     h

2.4 Data Cleansing

To check if there is any missing value from the data.

anyNA(tele)
## [1] FALSE
colSums(is.na(tele))
##        X  fLength   fWidth    fSize    fConc   fConc1    fAsym  fM3Long 
##        0        0        0        0        0        0        0        0 
## fM3Trans   fAlpha    fDist    class 
##        0        0        0        0

There is no missing value from this data.

Check data type for each column.

glimpse(tele)
## Rows: 19,020
## Columns: 12
## $ X        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ fLength  <dbl> 28.7967, 31.6036, 162.0520, 23.8172, 75.1362, 51.6240, 48.246~
## $ fWidth   <dbl> 16.0021, 11.7235, 136.0310, 9.5728, 30.9205, 21.1502, 17.3565~
## $ fSize    <dbl> 2.6449, 2.5185, 4.0612, 2.3385, 3.1611, 2.9085, 3.0332, 2.552~
## $ fConc    <dbl> 0.3918, 0.5303, 0.0374, 0.6147, 0.3168, 0.2420, 0.2529, 0.423~
## $ fConc1   <dbl> 0.1982, 0.3773, 0.0187, 0.3922, 0.1832, 0.1340, 0.1515, 0.217~
## $ fAsym    <dbl> 27.7004, 26.2722, 116.7410, 27.2107, -5.5277, 50.8761, 8.5730~
## $ fM3Long  <dbl> 22.0110, 23.8238, -64.8580, -6.4633, 28.5525, 43.1887, 38.095~
## $ fM3Trans <dbl> -8.2027, -9.9574, -45.2160, -7.1513, 21.8393, 9.8145, 10.5868~
## $ fAlpha   <dbl> 40.0920, 6.3609, 76.9600, 10.4490, 4.6480, 3.6130, 4.7920, 0.~
## $ fDist    <dbl> 81.8828, 205.2610, 256.7880, 116.7370, 356.4620, 238.0980, 21~
## $ class    <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "~
tele <- tele %>%
  select(-X) %>% 
  mutate(class = as.factor(class))

2.5 Exploratory Data Analysis

Before we use the data, we need to check the proportion between the data about gamma and hardon.

tele$class %>% 
   table() %>% 
   prop.table()
## .
##         g         h 
## 0.6483701 0.3516299

The proportion of the data is adequate, so we can use the data as it is.

2.6 Cross Validation

Here we want to split the data to make a train data and a test data. A train data is used to make modeling and the test data is used to test the model that we make using the train data.

set.seed(100)
splitted <- initial_split(data = tele, prop = 0.8, strata = "class")
tele_train <- training(splitted)
tele_test <- testing(splitted)

We will check again the proportion of the data.

tele_train$class %>% 
   table() %>% 
   prop.table()
## .
##         g         h 
## 0.6483538 0.3516462
tele_test$class %>% 
   table() %>% 
   prop.table()
## .
##         g         h 
## 0.6484354 0.3515646

Since the train data and the test data is adequate we can proceed to the next step.

3 Naive Bayes

The first model we will make is using Naive Bayes.

model_naive <- naiveBayes(class ~ .,
                          data = tele_train,
                          laplace = 1)

After we make the model, we will use the model to predict the test data to confirm the accuracy of the model. And we will save the predicted result in a column in the test data.

tele_test$predicted <- predict(object = model_naive,
                          newdata = tele_test,
                          type = "class")
tele_test %>% 
  select(class, predicted) %>% 
  head(10)
##    class predicted
## 1      g         g
## 6      g         g
## 7      g         g
## 21     g         h
## 23     g         g
## 25     g         g
## 26     g         h
## 28     g         g
## 30     g         g
## 44     g         g

We will evaluate the model using confusion matrix.

confusionMatrix(data = tele_test$predicted,
                reference = tele_test$class,
                positive = "g")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    g    h
##          g 2260  819
##          h  206  518
##                                           
##                Accuracy : 0.7305          
##                  95% CI : (0.7161, 0.7445)
##     No Information Rate : 0.6484          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3395          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9165          
##             Specificity : 0.3874          
##          Pos Pred Value : 0.7340          
##          Neg Pred Value : 0.7155          
##              Prevalence : 0.6484          
##          Detection Rate : 0.5943          
##    Detection Prevalence : 0.8096          
##       Balanced Accuracy : 0.6519          
##                                           
##        'Positive' Class : g               
## 

The result from the confusion matrix above shows that our naive bayes model has accuracy of 73.05 % on the test data, the meaning is that 73.05 % of our data is correctly classified.

4 Decision Tree

Next we will make a model using Decision Tree.

model_tree <- ctree(class ~ .,
                    data = tele_train)

plot(model_tree, type = "simple")

After we make the model, we will use the model to predict the test data to confirm the accuracy of the model.

pred_tree <- predict(model_tree, 
                     newdata = tele_test, 
                     type = "response")

We will evaluate the model using confusion matrix.

confusionMatrix(data = pred_tree,
                reference = tele_test$class,
                positive = "g")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    g    h
##          g 2306  403
##          h  160  934
##                                           
##                Accuracy : 0.852           
##                  95% CI : (0.8403, 0.8631)
##     No Information Rate : 0.6484          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6612          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9351          
##             Specificity : 0.6986          
##          Pos Pred Value : 0.8512          
##          Neg Pred Value : 0.8537          
##              Prevalence : 0.6484          
##          Detection Rate : 0.6064          
##    Detection Prevalence : 0.7123          
##       Balanced Accuracy : 0.8168          
##                                           
##        'Positive' Class : g               
## 

The result from the confusion matrix above shows that our decision tree model has accuracy of 85.20 % on the test data, the meaning is that 85.20 % of our data is correctly classified.

5 Random Forest

Lastly, we will make a model using random forest.

set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 4, repeats = 4)
forest <- train(class ~ ., 
                data = tele_train, 
                method = "rf", 
                trControl = ctrl)

forest
## Random Forest 
## 
## 15217 samples
##    10 predictor
##     2 classes: 'g', 'h' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 11412, 11413, 11412, 11414, 11412, 11412, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8770455  0.7224043
##    6    0.8767169  0.7221868
##   10    0.8756983  0.7196178
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(forest)

varImp(forest)
## rf variable importance
## 
##           Overall
## fAlpha   100.0000
## fLength   47.3335
## fWidth    37.5841
## fSize     27.1045
## fM3Long   19.0356
## fDist      5.6099
## fConc      3.9262
## fAsym      1.1624
## fConc1     0.9358
## fM3Trans   0.0000

From the model summary, we got the optimum number of variables considered for splitting at each tree node is 2 and fAlpha is the one who influenced the most of the model. Then we will see the out-of-bag (OOB) error rate.

plot(forest$finalModel)
legend("topright",
colnames(forest$finalModel$err.rate),
       col=1:6,cex=0.8,fill=1:6)

forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 12.2%
## Confusion matrix:
##      g    h class.error
## g 9266  600  0.06081492
## h 1256 4095  0.23472248

The OOB error rate is 12.20 %. Next, we will use the model to predict the test data.

pred_forest <- predict(forest, 
                     newdata = tele_test, 
                     type = "raw")

head(pred_forest)
## [1] g g g h g g
## Levels: g h

We will evaluate the model using confusion matrix.

confusionMatrix(data = pred_forest,
                reference = tele_test$class,
                positive = "g")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    g    h
##          g 2348  312
##          h  118 1025
##                                           
##                Accuracy : 0.8869          
##                  95% CI : (0.8764, 0.8968)
##     No Information Rate : 0.6484          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7435          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9521          
##             Specificity : 0.7666          
##          Pos Pred Value : 0.8827          
##          Neg Pred Value : 0.8968          
##              Prevalence : 0.6484          
##          Detection Rate : 0.6174          
##    Detection Prevalence : 0.6994          
##       Balanced Accuracy : 0.8594          
##                                           
##        'Positive' Class : g               
## 

The result from the confusion matrix above shows that our random forest model has accuracy of 88.69 % on the test data, the meaning is that 88.69 % of our data is correctly classified.

6 Conclusion

data.frame("Model" = c("Naive Bayes", "Decision Tree", "Random Forest"),
           "Accuracy" = c(73.05, 85.20, 88.69),
           "Sensitivity" = c(91.65, 93.51, 95.21),
           "Specificity" = c(38.74, 69.86, 76.66),
           "Pos Pred Value" = c(73.40, 85.12, 88.27))
##           Model Accuracy Sensitivity Specificity Pos.Pred.Value
## 1   Naive Bayes    73.05       91.65       38.74          73.40
## 2 Decision Tree    85.20       93.51       69.86          85.12
## 3 Random Forest    88.69       95.21       76.66          88.27

From all the model and the test that we already done before, we can conclude that using Random Forest model is the best option that we can get. It means that the next time we need to classify Gamma particles using the data that we got from Cherenkov gamma telescope we can use the Random Forest model to classify it.