Stellar_Classification

Introduction

In this project, I’m going to look at the Stellar Classification dataset and build a machine learning model that can predict the class of different objects in the dataset. The stellar classification dataset which is available on Kaggle at the following URL:

https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17

It consists of 100,000 rows of observations of deep sky objects. I’ll be using the Caret package to build the classification model. Let’s load the required libraries, then load the dataset and take a look at the columns.

## Load the required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(caret)

## Loading required package: lattice

## Load the stellar_classification.csv
df <- read.csv("stellar_classification.csv")

## Take a look at the columns of the dataset
names(df)

##  [1] "obj_ID"      "alpha"       "delta"       "u"           "g"          
##  [6] "r"           "i"           "z"           "run_ID"      "rerun_ID"   
## [11] "cam_col"     "field_ID"    "spec_obj_ID" "class"       "redshift"   
## [16] "plate"       "MJD"         "fiber_ID"

Columns in the dataset

We’re really only concerned with 7 of the columns in the dataset. A full list of the columns can be found at the Kaggle link above. Below is a list of the columns that we are concerned with:

u = Ultraviolet filter in the photometric system
g = Green filter in the photometric system
r = Red filter in the photometric system
i = Near Infrared filter in the photometric system
z = Infrared filter in the photometric system
class = object class (galaxy, star or quasar object)
redshift = redshift value based on the increase in wavelength

The class is what we are trying to predict. I’m going to use the dplyr package to select only the columns that we need, then make sure that the class column is a factor so that it can be used for classification.

stellarClean <- select(df, u, g, r, i, z, redshift, class)
stellarClean$class <- factor(stellarClean$class)

Partitioning the data

In this section, we’re going to set the seed value for reproducibility, and split the data in to a training and testing set.

## Set the seed value
set.seed(3845)

## Split the stellarClean set into a training and testing set.
inTrain <- createDataPartition(stellarClean$class, p = 0.75, list = FALSE)
training <- stellarClean[inTrain,]
testing <- stellarClean[-inTrain,]

Model Building

We’re going to use a few different models and check the accuracy of the models. The first model that we will create is a recursive partitioning model, and we will use all of the available columns in the training dataset. Once the model is trained, we will check the accuracy by looking at the confusion matrix with the testing dataset.

modelrpart <- train(class ~ ., data = training, method = "rpart")
confusionMatrix(testing$class, predict(modelrpart, newdata = testing))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction GALAXY   QSO  STAR
##     GALAXY  10855   184   107
##     QSO       690  2863     2
##     STAR        0     0  4049
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9476          
##                  95% CI : (0.9443, 0.9507)
##     No Information Rate : 0.6157          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9056          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: GALAXY Class: QSO Class: STAR
## Sensitivity                 0.9402     0.9396      0.9738
## Specificity                 0.9596     0.9559      1.0000
## Pos Pred Value              0.9739     0.8053      1.0000
## Neg Pred Value              0.9093     0.9879      0.9926
## Prevalence                  0.6157     0.1625      0.2218
## Detection Rate              0.5789     0.1527      0.2159
## Detection Prevalence        0.5945     0.1896      0.2159
## Balanced Accuracy           0.9499     0.9478      0.9869

This model has an accuracy rating of 94.76% with a Kappa value of 0.9056. This is fairly accurate, but other models may perform better.

Next, we will try a linear discriminant analysis model and see if we can improve the accuracy.

modellda <- train(class ~ ., data = training, method = "lda")
confusionMatrix(testing$class, predict(modellda, newdata = testing))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction GALAXY   QSO  STAR
##     GALAXY  10570   137   439
##     QSO       566  2942    47
##     STAR     1704     2  2343
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8456          
##                  95% CI : (0.8403, 0.8507)
##     No Information Rate : 0.6848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7082          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: GALAXY Class: QSO Class: STAR
## Sensitivity                 0.8232     0.9549      0.8282
## Specificity                 0.9025     0.9609      0.8928
## Pos Pred Value              0.9483     0.8276      0.5787
## Neg Pred Value              0.7015     0.9909      0.9669
## Prevalence                  0.6848     0.1643      0.1509
## Detection Rate              0.5637     0.1569      0.1250
## Detection Prevalence        0.5945     0.1896      0.2159
## Balanced Accuracy           0.8629     0.9579      0.8605

The LDA model doesn’t perform as well as the recursive partitioning model does. It has an accuracy of 84.56% and a Kappa value of 0.7082. We’ll try a naive Bayes model and see if we get better results.

modelnb <- train(class ~ ., data = training, method = "naive_bayes")
confusionMatrix(testing$class, predict(modelnb, newdata = testing))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction GALAXY   QSO  STAR
##     GALAXY  10239   898     9
##     QSO       414  3128    13
##     STAR     2928   612   509
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7401          
##                  95% CI : (0.7337, 0.7463)
##     No Information Rate : 0.7243          
##     P-Value [Acc > NIR] : 6.459e-07       
##                                           
##                   Kappa : 0.4966          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: GALAXY Class: QSO Class: STAR
## Sensitivity                 0.7539     0.6744     0.95857
## Specificity                 0.8245     0.9697     0.80570
## Pos Pred Value              0.9186     0.8799     0.12571
## Neg Pred Value              0.5605     0.9006     0.99850
## Prevalence                  0.7243     0.2474     0.02832
## Detection Rate              0.5461     0.1668     0.02715
## Detection Prevalence        0.5945     0.1896     0.21595
## Balanced Accuracy           0.7892     0.8221     0.88213

The results of the naive Bayes model are the worst so far, with an accuracy of 74.01% and a Kappa value of 0.4966. The last model that we will try is a random forest model.

modelrf <- train(class ~ ., data = training, method = "rf")
confusionMatrix(testing$class, predict(modelrf, newdata = testing))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction GALAXY   QSO  STAR
##     GALAXY  14653   188    20
##     QSO       320  4420     0
##     STAR        3     0  5395
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9788          
##                  95% CI : (0.9769, 0.9805)
##     No Information Rate : 0.5991          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9622          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: GALAXY Class: QSO Class: STAR
## Sensitivity                 0.9784     0.9592      0.9963
## Specificity                 0.9792     0.9843      0.9998
## Pos Pred Value              0.9860     0.9325      0.9994
## Neg Pred Value              0.9681     0.9907      0.9990
## Prevalence                  0.5991     0.1843      0.2166
## Detection Rate              0.5861     0.1768      0.2158
## Detection Prevalence        0.5945     0.1896      0.2159
## Balanced Accuracy           0.9788     0.9718      0.9981

We can see the results of the random forest model are the best out of all the machine learning classification models. It has an accuracy of 97.88%, with a Kappa value of 0.9622.

Conclusion

Out of all of the machine learning methods that we used, the random forest was the best model for classifying stellar objects in the dataset. The results from the testing set show that it is a reliable model and we expect the out-of-sample error to be around 2%.