In this project, I’m going to look at the Stellar Classification dataset and build a machine learning model that can predict the class of different objects in the dataset. The stellar classification dataset which is available on Kaggle at the following URL:
https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17
It consists of 100,000 rows of observations of deep sky objects. I’ll be using the Caret package to build the classification model. Let’s load the required libraries, then load the dataset and take a look at the columns.
## Load the required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Loading required package: lattice
## Load the stellar_classification.csv
df <- read.csv("stellar_classification.csv")
## Take a look at the columns of the dataset
names(df)
## [1] "obj_ID" "alpha" "delta" "u" "g"
## [6] "r" "i" "z" "run_ID" "rerun_ID"
## [11] "cam_col" "field_ID" "spec_obj_ID" "class" "redshift"
## [16] "plate" "MJD" "fiber_ID"
We’re really only concerned with 7 of the columns in the dataset. A full list of the columns can be found at the Kaggle link above. Below is a list of the columns that we are concerned with:
The class is what we are trying to predict. I’m going to use the dplyr package to select only the columns that we need, then make sure that the class column is a factor so that it can be used for classification.
stellarClean <- select(df, u, g, r, i, z, redshift, class)
stellarClean$class <- factor(stellarClean$class)
In this section, we’re going to set the seed value for reproducibility, and split the data in to a training and testing set.
## Set the seed value
set.seed(3845)
## Split the stellarClean set into a training and testing set.
inTrain <- createDataPartition(stellarClean$class, p = 0.75, list = FALSE)
training <- stellarClean[inTrain,]
testing <- stellarClean[-inTrain,]
We’re going to use a few different models and check the accuracy of the models. The first model that we will create is a recursive partitioning model, and we will use all of the available columns in the training dataset. Once the model is trained, we will check the accuracy by looking at the confusion matrix with the testing dataset.
modelrpart <- train(class ~ ., data = training, method = "rpart")
confusionMatrix(testing$class, predict(modelrpart, newdata = testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction GALAXY QSO STAR
## GALAXY 10855 184 107
## QSO 690 2863 2
## STAR 0 0 4049
##
## Overall Statistics
##
## Accuracy : 0.9476
## 95% CI : (0.9443, 0.9507)
## No Information Rate : 0.6157
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9056
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: GALAXY Class: QSO Class: STAR
## Sensitivity 0.9402 0.9396 0.9738
## Specificity 0.9596 0.9559 1.0000
## Pos Pred Value 0.9739 0.8053 1.0000
## Neg Pred Value 0.9093 0.9879 0.9926
## Prevalence 0.6157 0.1625 0.2218
## Detection Rate 0.5789 0.1527 0.2159
## Detection Prevalence 0.5945 0.1896 0.2159
## Balanced Accuracy 0.9499 0.9478 0.9869
This model has an accuracy rating of 94.76% with a Kappa value of 0.9056. This is fairly accurate, but other models may perform better.
Next, we will try a linear discriminant analysis model and see if we can improve the accuracy.
modellda <- train(class ~ ., data = training, method = "lda")
confusionMatrix(testing$class, predict(modellda, newdata = testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction GALAXY QSO STAR
## GALAXY 10570 137 439
## QSO 566 2942 47
## STAR 1704 2 2343
##
## Overall Statistics
##
## Accuracy : 0.8456
## 95% CI : (0.8403, 0.8507)
## No Information Rate : 0.6848
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7082
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: GALAXY Class: QSO Class: STAR
## Sensitivity 0.8232 0.9549 0.8282
## Specificity 0.9025 0.9609 0.8928
## Pos Pred Value 0.9483 0.8276 0.5787
## Neg Pred Value 0.7015 0.9909 0.9669
## Prevalence 0.6848 0.1643 0.1509
## Detection Rate 0.5637 0.1569 0.1250
## Detection Prevalence 0.5945 0.1896 0.2159
## Balanced Accuracy 0.8629 0.9579 0.8605
The LDA model doesn’t perform as well as the recursive partitioning model does. It has an accuracy of 84.56% and a Kappa value of 0.7082. We’ll try a naive Bayes model and see if we get better results.
modelnb <- train(class ~ ., data = training, method = "naive_bayes")
confusionMatrix(testing$class, predict(modelnb, newdata = testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction GALAXY QSO STAR
## GALAXY 10239 898 9
## QSO 414 3128 13
## STAR 2928 612 509
##
## Overall Statistics
##
## Accuracy : 0.7401
## 95% CI : (0.7337, 0.7463)
## No Information Rate : 0.7243
## P-Value [Acc > NIR] : 6.459e-07
##
## Kappa : 0.4966
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: GALAXY Class: QSO Class: STAR
## Sensitivity 0.7539 0.6744 0.95857
## Specificity 0.8245 0.9697 0.80570
## Pos Pred Value 0.9186 0.8799 0.12571
## Neg Pred Value 0.5605 0.9006 0.99850
## Prevalence 0.7243 0.2474 0.02832
## Detection Rate 0.5461 0.1668 0.02715
## Detection Prevalence 0.5945 0.1896 0.21595
## Balanced Accuracy 0.7892 0.8221 0.88213
The results of the naive Bayes model are the worst so far, with an accuracy of 74.01% and a Kappa value of 0.4966. The last model that we will try is a random forest model.
modelrf <- train(class ~ ., data = training, method = "rf")
confusionMatrix(testing$class, predict(modelrf, newdata = testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction GALAXY QSO STAR
## GALAXY 14653 188 20
## QSO 320 4420 0
## STAR 3 0 5395
##
## Overall Statistics
##
## Accuracy : 0.9788
## 95% CI : (0.9769, 0.9805)
## No Information Rate : 0.5991
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9622
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: GALAXY Class: QSO Class: STAR
## Sensitivity 0.9784 0.9592 0.9963
## Specificity 0.9792 0.9843 0.9998
## Pos Pred Value 0.9860 0.9325 0.9994
## Neg Pred Value 0.9681 0.9907 0.9990
## Prevalence 0.5991 0.1843 0.2166
## Detection Rate 0.5861 0.1768 0.2158
## Detection Prevalence 0.5945 0.1896 0.2159
## Balanced Accuracy 0.9788 0.9718 0.9981
We can see the results of the random forest model are the best out of all the machine learning classification models. It has an accuracy of 97.88%, with a Kappa value of 0.9622.
Out of all of the machine learning methods that we used, the random forest was the best model for classifying stellar objects in the dataset. The results from the testing set show that it is a reliable model and we expect the out-of-sample error to be around 2%.