Classification of Galaxies, Stars and Quasars based on the RD14 from the SDSS

In this project we will try to classify observations of the space to either stars, galaxies or quasars. We will discuss the complete cycle of TDSP model workflow for this project.

It is the data from Slogan sky Digital Survey. It offers the public data of space observations. 2.5 m diameter telescope was built at the Apache Point Observatory in New Mexico for this observation. The telescope uses a camera of 30 CCD-Chips with 2048x2048 image points each. The chips are ordered in 5 rows with 6 chips in each row. Each row observes the space through different optical filters (u, g, r, i, z) at wavelengths of approximately 354, 476, 628, 769, 925 nm.

The observation focuses on northern part of the sky.

Data acquisition

The data of this survey is publicly available and can be accessed through multiple ways.We can directly download data from their official site https://www.sdss.org/. But we decided to download the data from Kaggle as the subsetted data was already available there.

Data Understanding

Feature Description

View “PhotoObj” objid = Object Identifier ra = J2000 Right Ascension (r-band) dec = J2000 Declination (r-band)

Right ascension (abbreviated RA) is the angular distance measured eastward along the celestial equator from the Sun at the March equinox to the hour circle of the point above the earth in question. When paired with declination (abbreviated dec), these astronomical coordinates specify the direction of a point on the celestial sphere (traditionally called in English the skies or the sky) in the equatorial coordinate system.

Source: https://en.wikipedia.org/wiki/Right_ascension

u = better of DeV/Exp magnitude fit g = better of DeV/Exp magnitude fit r = better of DeV/Exp magnitude fit i = better of DeV/Exp magnitude fit z = better of DeV/Exp magnitude fit The Thuan-Gunn astronomic magnitude system. u, g, r, i, z represent the response of the 5 bands of the telescope.

Further education: https://www.astro.umd.edu/~ssm/ASTR620/mags.html

run = Run Number rereun = Rerun Number camcol = Camera column field = Field number Run, rerun, camcol and field are features which describe a field within an image taken by the SDSS. A field is basically a part of the entire image corresponding to 2048 by 1489 pixels. A field can be identified by: run number, which identifies the specific scan, the camera column, or “camcol,” a number from 1 to 6, identifying the scanline within the run, and the field number. The field number typically starts at 11 (after an initial rampup time), and can be as large as 800 for particularly long runs. An additional number, rerun, specifies how the image was processed. View “SpecObj” specobjid = Object Identifier class = object class (galaxy, star or quasar object) The class identifies an object to be either a galaxy, star or quasar. This will be the response variable which we will be trying to predict.

redshift = Final Redshift plate = plate number mjd = MJD of observation fiberid = fiber ID In physics, redshift happens when light or other electromagnetic radiation from an object is increased in wavelength, or shifted to the red end of the spectrum.

Each spectroscopic exposure employs a large, thin, circular metal plate that positions optical fibers via holes drilled at the locations of the images in the telescope focal plane. These fibers then feed into the spectrographs. Each plate has a unique serial number, which is called plate in views such as SpecObj in the CAS.

Modified Julian Date, used to indicate the date that a given piece of SDSS data (image or spectrum) was taken.

The SDSS spectrograph uses optical fibers to direct the light at the focal plane from individual objects to the slithead. Each object is assigned a corresponding fiberID.

#Importing the data
library(readr)
sdss_data_original <- read_csv("E:/Data Science/Statistics with R/Data files/Skyserver_SQL2_27_2018 6_51_39 PM.csv")
sdss_data <- data.frame(sdss_data_original)
head(sdss_data)

#Importing the libraries
library(xgboost)
library(dplyr)
library(ggplot2)

Data exploration

str(sdss_data)

## 'data.frame':    10000 obs. of  18 variables:
##  $ objid    : num  1.24e+18 1.24e+18 1.24e+18 1.24e+18 1.24e+18 ...
##  $ ra       : num  184 184 184 184 184 ...
##  $ dec      : num  0.0897 0.1353 0.1262 0.0499 0.1026 ...
##  $ u        : num  19.5 18.7 19.4 17.8 17.6 ...
##  $ g        : num  17 17.2 18.2 16.6 16.3 ...
##  $ r        : num  15.9 16.7 17.5 16.2 16.4 ...
##  $ i        : num  15.5 16.5 17.1 16 16.6 ...
##  $ z        : num  15.2 16.4 16.8 15.9 16.6 ...
##  $ run      : num  752 752 752 752 752 752 752 752 752 752 ...
##  $ rerun    : num  301 301 301 301 301 301 301 301 301 301 ...
##  $ camcol   : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ field    : num  267 267 268 269 269 269 269 269 270 270 ...
##  $ specobjid: num  3.72e+18 3.64e+17 3.23e+17 3.72e+18 3.72e+18 ...
##  $ class    : chr  "STAR" "STAR" "GALAXY" "STAR" ...
##  $ redshift : num  -8.96e-06 -5.49e-05 1.23e-01 -1.11e-04 5.90e-04 ...
##  $ plate    : num  3306 323 287 3306 3306 ...
##  $ mjd      : num  54922 51615 52023 54922 54922 ...
##  $ fiberid  : num  491 541 513 510 512 594 559 515 595 400 ...

sdss_data$class <- as.factor(sdss_data$class)
table(sdss_data$class)

## 
## GALAXY    QSO   STAR 
##   4998    850   4152

missing_data <- sdss_data %>% filter_all(any_vars(is.na(.)))

Data has 10000 observations with 18 variables. All variables have numeric data except class. The class variable has character data type. And there is no missing datas in our data. Among them, almost around 50 percent (4998) rows are classified as galaxies. 4152 rows are classified as stars and only 850 rows as quasars.

Data Cleaning

objid and specobjid are just identifiers for accessing the rows back when they were stored in the original database. Therefore we will not need them for classification as they are not related to the outcome.

The features ‘run’, ‘rerun’, ‘camcol’ and ‘field’ are values which describe parts of the camera at the moment when making the observation, e.g. ‘run’ represents the corresponding scan which captured the oject.

So we can drom these variables from our data.

sdss_data <- select (sdss_data,-c('objid', 'run', 'rerun', 'camcol', 'field', 'specobjid'))

Univariate Anaysis

RedShift

To start the univariate analysis we will plot histograms for the ‘redshift’ feature column for each class.

This will tell us how the redshift values are distributed over their range.

sdss_data %>% 
ggplot(aes(x = redshift)) +geom_histogram(bins = 50, col = "steelblue") +
  facet_wrap(~class, scales = "free")

This result is quite significant.

We can cleary tell that the redshift values for the classes seem different for galaxies, qasars and stars.

Star: The histogram looks like a zero-centered normal distribution.

Galaxy: The redshift values may come from a slightly right-shifted normal distribution which is centered around 0.075.

QSO: The redshift values for QSOs are a lot more uniformly distributed than for Stars or Galaxies. They are roughly evenly distributed from 0 to 3, than the occurences decrease drastically. For 4 oder ~5.5 there are some outliers.

The red shift value can be used to estimate the distance of these objects from the earth.

The histogram plot tells us that moth of the stars observed are closer to eath than the galaxies and qasars. Galaxies seem to be little far away and quasars are the most distant objects.

The reason we are observing these galaxies and quasars might be due to their size and physical structure, they can be observed from further away than “small” stars. We can draw so many information about these classes from just the redshift variabe. So it is one of the most important variale of our data while classifying new objects.

dec(position of the celectial equador)

Lets plor the class variablea against the dec wich gives us the idea about where these objects are located along the celestial longitude.

library(lvplot)
sdss_data %>% ggplot(aes(y=dec,x=class)) + geom_lv(mapping = NULL, stat = "lv", position = "dodge",
 outlier.shape = 19, outlier.size = 1.5,
  outlier.stroke = 0.5, varwidth = FALSE,
  width.method = "linear", show.legend = NA, inherit.aes = TRUE) + 
  facet_wrap(facets=~class, scales="free")  +
  xlab("class") +
  ylab("dec") +
  ggtitle("dec")

Interpretation:

The Letter value Plot gives the estimation of the distribution of the data. It shows boxes which relate to the amount of values within the range of values inside the box. From the plot, we can observe the clear distinction between stars and other objects( galaxis and quasars). On the other hand there are similarities between galazies and qasars plot.

Star: Most of the data points lay within a 0 to 10 range. Another large part consists of values between about 10 to 55. Only small amounts of the data are lower or higher than these ranges.

Galaxy: Most of the values lays between 0 and 45. There is a smaller amount of values in the range of 45 to 60. The rest of the data has smaller or higher values.

QSO: This plot looks quite similiar to the GALAXY plot. Only the amount of data points in the range of 0 to 60 is even bigger.

From this we might conclude that most of the quasars and galaxies can be found at almost similar positions in the celestial sphare.But we can not say it for sure looking only at this plot, so we will have to do further analysis.

Multivariate Analysis

u,g,r,i,z filters (different wavelength used to capture the objects)

Now we are going the check the correlation between these variable

library(reshape2)
#creating the data fram of u, g,r,i ,z
filter(sdss_data_original, class == 'STAR') %>% select( c(4,5,6,7,8)) -> mydata_star 
filter(sdss_data_original, class == 'GALAXY') %>% select( c(4,5,6,7,8)) -> mydata_galaxy
filter(sdss_data_original, class == 'QSO') %>% select( c(4,5,6,7,8)) -> mydata_qso

#creating correlation matrix
cormat_star <- round(cor(mydata_star),2)
cormat_galaxy <- round(cor(mydata_galaxy),2)
cormat_qso <- round(cor(mydata_qso),2)
head(cormat_star)

##      u    g    r    i    z
## u 1.00 0.91 0.79 0.71 0.71
## g 0.91 1.00 0.97 0.92 0.92
## r 0.79 0.97 1.00 0.97 0.98
## i 0.71 0.92 0.97 1.00 0.97
## z 0.71 0.92 0.98 0.97 1.00

head(cormat_galaxy)

##      u    g    r    i    z
## u 1.00 0.88 0.73 0.68 0.62
## g 0.88 1.00 0.96 0.93 0.91
## r 0.73 0.96 1.00 0.99 0.98
## i 0.68 0.93 0.99 1.00 0.99
## z 0.62 0.91 0.98 0.99 1.00

head(cormat_qso)

##      u    g    r    i    z
## u 1.00 0.90 0.80 0.73 0.68
## g 0.90 1.00 0.95 0.90 0.87
## r 0.80 0.95 1.00 0.98 0.95
## i 0.73 0.90 0.98 1.00 0.97
## z 0.68 0.87 0.95 0.97 1.00

#Melting the  correlation matrix
melted_cormat_star <- melt(cormat_star)
melted_cormat_galaxy <- melt(cormat_galaxy)
melted_cormat_qso <- melt(cormat_qso)
head(melted_cormat_star)

head(melted_cormat_galaxy)

head(melted_cormat_qso)

#Visualizing the correlation matrix with ggplot2
par(mfrow=c(1,3))
ggplot(data = melted_cormat_star, aes(x=Var1, y=Var2, fill=value)) + geom_tile()+ ggtitle("Star")

ggplot(data = melted_cormat_galaxy, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + ggtitle("Galaxy")

ggplot(data = melted_cormat_qso, aes(x=Var1, y=Var2, fill=value)) + geom_tile()+ ggtitle("QSO")

It is not really surprising for us to see from the plot that there is a high correlation between different bands. Because each band is capturing one or another objects. But we can see there is low correlation of the band u with other bands. u, g, r, i, z capture light at wavelengths of 354, 476, 628, 769 and 925 nm.So, this might mean that the objects shine brighter at elevenths from bang g to z (ie 476 to 925).

But on the other hand the heatmap of all 3 objects looks almost similar. It means that, these different bands behave similar for all these objects.

Right ascension (ra) and declination (dec)

We will now plot ra and dec, depending upon the class.

sdss_data %>% ggplot(aes(ra, dec, colour = class)) + geom_point()

From the plot, there is no way we can distinguish the positiong of these objects. Or we can not classify them based on their position in celestial sphere.

It contradicts our previous conclusion that we had gotten from the lvplot (that we can separate the objects based on their position).

Feature Engineering

u, g, r, i, z

We will now reduce the dimensions by replacing the different bands ‘u’, ‘g’, ‘r’, ‘i’ and ‘z’ by a linear combination with only 3 dimensions using Principal Component Analysis.

sdss_data_1 <- data.frame(sdss_data)
pca <- prcomp(select(sdss_data_1, c(u,g,r,i,z)),
             center = TRUE,
            scale. = TRUE)
#updating the dataframe


sdss_data <- select(sdss_data,!c(u,g,r,i,z))
x <- data.frame(pca$x)
sdss_data <- cbind(sdss_data,select(x, c(PC1,PC2,PC3)))

Machine Learning Models Training

In this section we will be splitting our data into training and testing set and then we will try and test different models. We will use various machine learning algorithms and cross validate the model as well.

Feature Scaling

Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values. It will eventually make some models converge faster.

set.seed(123)
sdss_data_scaled <- sdss_data %>% mutate_if(is.numeric, scale)

Splitting the dataset Now we will split the dataset into testing and training set

set.seed(1234)
ind <- sample(2, nrow(sdss_data), replace = TRUE, prob = c(0.7, 0.3))
train <- sdss_data_scaled[ind == 1, ]
test <- sdss_data_scaled[ind == 2, ]

Modeling

K-nearest Neighborhood algorithm

library(caret)

## Loading required package: lattice

library(class)
train_scale <- train[ , -which(names(train) %in% c("class"))]
test_scale <- test[ , -which(names(test) %in% c("class"))]

# Fitting KNN Model 
# to training dataset
set.seed(1234)
#K = 1
classifier_knn <- knn(train = train_scale,
                      test = test_scale,
                      cl = train$class,
                      k = 1)
head(classifier_knn)

## [1] STAR   STAR   QSO    GALAXY STAR   STAR  
## Levels: GALAXY QSO STAR

##create confusion matrix
cm_1 <- table(test$class, classifier_knn)
misClassError <- mean(classifier_knn != test$class)
print(paste('Accuracy =', 1-misClassError))

## [1] "Accuracy = 0.952686718485975"

(cm_1)

##         classifier_knn
##          GALAXY  QSO STAR
##   GALAXY   1407    9   68
##   QSO         6  241    2
##   STAR       49    6 1171

# K = 3
classifier_knn <- knn(train = train_scale,
                      test = test_scale,
                      cl = train$class,
                      k = 3)
misClassError <- mean(classifier_knn != test$class)
print(paste('Accuracy =', 1-misClassError))

## [1] "Accuracy = 0.954038526529233"

# K = 5
classifier_knn <- knn(train = train_scale,
                      test = test_scale,
                      cl = train$class,
                      k = 5)
##create confusion matrix
cm_5 <- table(test$class, classifier_knn)

(cm_5)

##         classifier_knn
##          GALAXY  QSO STAR
##   GALAXY   1418    6   60
##   QSO         7  237    5
##   STAR       40    5 1181

misClassError <- mean(classifier_knn != test$class)
print(paste('Accuracy =', 1-misClassError))

## [1] "Accuracy = 0.958431902669821"

table(sdss_data_original$class)

## 
## GALAXY    QSO   STAR 
##   4998    850   4152

XGBoost

For using xgboost, we need to make data converted to sparse matrix so that it will be compatible to feed into xgb.train method.

library(caret)
library(xgboost)
library(Matrix)

#creating numeric labels 
train_labs <- as.numeric(train$class)-1 
test_labs <- as.numeric(test$class)-1

#creating Matrices
#for training
trainM <- sparse.model.matrix(class ~., data = train)
train_label = train[, "class"]
nrow(train_label)

## NULL

nrow(trainM)

## [1] 7041

train_matrix <- xgb.DMatrix(data = as.matrix(trainM), label = train_labs)

#for testing
testM <- sparse.model.matrix(class~.,-9, data = test)
test_label = test[, "class"]
nrow(test_label)

## NULL

nrow(testM)

## [1] 2959

test_matrix <- xgb.DMatrix(data = as.matrix(testM), label = test_labs)


# Set parameters(default)
params <- list(booster = "gbtree", objective = "multi:softprob", num_class = 3, eval_metric = "mlogloss")

# Calculate # of folds for cross-validation
xgbcv <- xgb.cv(params = params, data = train_matrix, nrounds = 100, nfold = 5, showsd = TRUE, stratified = TRUE, print.every.n = 10, early_stop_round = 20, maximize = FALSE, prediction = TRUE)

## [1]  train-mlogloss:0.708638+0.000242    test-mlogloss:0.710699+0.000741 
## [11] train-mlogloss:0.036226+0.000919    test-mlogloss:0.050232+0.005235 
## [21] train-mlogloss:0.005913+0.000332    test-mlogloss:0.027391+0.006998 
## [31] train-mlogloss:0.002159+0.000148    test-mlogloss:0.027706+0.008495 
## [41] train-mlogloss:0.001331+0.000089    test-mlogloss:0.028764+0.009114 
## [51] train-mlogloss:0.001034+0.000071    test-mlogloss:0.029575+0.009356 
## [61] train-mlogloss:0.000873+0.000051    test-mlogloss:0.029986+0.009605 
## [71] train-mlogloss:0.000776+0.000042    test-mlogloss:0.030296+0.009734 
## [81] train-mlogloss:0.000709+0.000034    test-mlogloss:0.030557+0.010020 
## [91] train-mlogloss:0.000659+0.000031    test-mlogloss:0.030715+0.010171 
## [100]    train-mlogloss:0.000624+0.000028    test-mlogloss:0.030954+0.010323

# Function to compute classification error
classification_error <- function(conf_mat) {
  conf_mat = as.matrix(conf_mat)
  
  error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
  
  return (error)
}


# Mutate xgb output to deliver predictions
xgb_train_preds <- data.frame(xgbcv$pred) %>% mutate(max = max.col(., ties.method = "last"), label = train_labs + 1)


# Examine output
head(xgb_train_preds)

xgb_conf_mat <- confusionMatrix(factor(xgb_train_preds$label),
                                  factor(xgb_train_preds$max),
                                  mode = "everything")

print(xgb_conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3
##          1 3492   14    8
##          2   19  582    0
##          3    3    0 2923
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9938          
##                  95% CI : (0.9916, 0.9955)
##     No Information Rate : 0.4991          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9891          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9937  0.97651   0.9973
## Specificity            0.9938  0.99705   0.9993
## Pos Pred Value         0.9937  0.96839   0.9990
## Neg Pred Value         0.9938  0.99783   0.9981
## Precision              0.9937  0.96839   0.9990
## Recall                 0.9937  0.97651   0.9973
## F1                     0.9937  0.97243   0.9981
## Prevalence             0.4991  0.08465   0.4163
## Detection Rate         0.4960  0.08266   0.4151
## Detection Prevalence   0.4991  0.08536   0.4156
## Balanced Accuracy      0.9938  0.98678   0.9983

The accuracy of the model is 0.9942 0.98545 0.9986 for 3 classes( ie star, galaxy and QSO).

Now lets check the model in the test data set

# Create the model
xgb_model <- xgb.train(params = params, data = train_matrix, nrounds = 100)

# Predict for validation set
xgb_test_preds <- predict(xgb_model, newdata = test_matrix)

xgb_test_out <- matrix(xgb_test_preds, nrow = 3, ncol = length(xgb_test_preds) / 3) %>% 
               t() %>%
               data.frame() %>%
               mutate(max = max.col(., ties.method = "last"), label = test_labs + 1) 

# Confustion Matrix
xgb_test_conf <- table(true = test_labs + 1, pred = xgb_test_out$max)

(xgb_test_conf)

##     pred
## true    1    2    3
##    1 1473   10    1
##    2    6  242    1
##    3    1    0 1225

cat("XGB Validation Classification Error Rate:", classification_error(xgb_test_conf), "\n")

## XGB Validation Classification Error Rate: 0.006421088

cat("XGB Validation Classification Accuracy:", 1 - classification_error(xgb_test_conf), "\n")

## XGB Validation Classification Accuracy: 0.9935789

Support Vector Machine Classifier

library(e1071)  
svm1 <- svm(class~., data=train, 
          method="C-classification", kernal="radial", 
          gamma=0.1, cost=10)
summary(svm1)

## 
## Call:
## svm(formula = class ~ ., data = train, method = "C-classification", 
##     kernal = "radial", gamma = 0.1, cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  827
## 
##  ( 392 346 89 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  GALAXY QSO STAR

#Predicting the data
prediction_svm <- predict(svm1, test)
conm_svm <- confusionMatrix(test$class, prediction_svm)
print(paste('Accuracy =', conm_svm$overall["Accuracy"]))

## [1] "Accuracy = 0.98107468739439"

(conm_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction GALAXY  QSO STAR
##     GALAXY   1446   10   28
##     QSO         9  239    1
##     STAR        6    2 1218
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9811          
##                  95% CI : (0.9755, 0.9857)
##     No Information Rate : 0.4937          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9668          
##                                           
##  Mcnemar's Test P-Value : 0.002171        
## 
## Statistics by Class:
## 
##                      Class: GALAXY Class: QSO Class: STAR
## Sensitivity                 0.9897    0.95219      0.9767
## Specificity                 0.9746    0.99631      0.9953
## Pos Pred Value              0.9744    0.95984      0.9935
## Neg Pred Value              0.9898    0.99557      0.9833
## Prevalence                  0.4937    0.08483      0.4214
## Detection Rate              0.4887    0.08077      0.4116
## Detection Prevalence        0.5015    0.08415      0.4143
## Balanced Accuracy           0.9822    0.97425      0.9860

Random Forest Classifier

library(rpart)
#Fitting the model 
set.seed(12345)
# Training with classification tree
modfit.rpart <- rpart(class ~ ., data=train, method="class", xval = 4)
print(modfit.rpart, digits = 3)

## n= 7041 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 7041 3530 GALAXY (0.499077 0.085357 0.415566)  
##   2) redshift>=-0.366 4102  604 GALAXY (0.852755 0.146514 0.000731)  
##     4) redshift< 0.18 3529   47 GALAXY (0.986682 0.012468 0.000850) *
##     5) redshift>=0.18 573   16 QSO (0.027923 0.972077 0.000000) *
##   3) redshift< -0.366 2939   16 STAR (0.005444 0.000000 0.994556) *

# Predicting the testing set  
predictions_rf <- predict(modfit.rpart, test ,type = "class")


# Accuracy and other metrics
conm_rf <- confusionMatrix(predictions_rf, test$class)
(conm_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction GALAXY  QSO STAR
##     GALAXY   1466   14    4
##     QSO        12  234    0
##     STAR        6    1 1222
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9875          
##                  95% CI : (0.9828, 0.9912)
##     No Information Rate : 0.5015          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.978           
##                                           
##  Mcnemar's Test P-Value : 0.6699          
## 
## Statistics by Class:
## 
##                      Class: GALAXY Class: QSO Class: STAR
## Sensitivity                 0.9879    0.93976      0.9967
## Specificity                 0.9878    0.99557      0.9960
## Pos Pred Value              0.9879    0.95122      0.9943
## Neg Pred Value              0.9878    0.99447      0.9977
## Prevalence                  0.5015    0.08415      0.4143
## Detection Rate              0.4954    0.07908      0.4130
## Detection Prevalence        0.5015    0.08314      0.4153
## Balanced Accuracy           0.9878    0.96767      0.9963

print(paste('Accuracy =', conm_rf$overall["Accuracy"]))

## [1] "Accuracy = 0.987495775599865"

Feature Importance

Decision trees have unique ability of being able to order features by their ability to split between the classes. Now we will analyze the feature importance based on the random forest model.

#Conditional=True, adjusts for correlations between predictors.
i_scores <- varImp(modfit.rpart, conditional=TRUE)
#Gathering rownames in 'var'  and converting it to the factor
#to provide 'fill' parameter for the bar chart. 
i_scores <- i_scores %>% tibble::rownames_to_column("var") 
i_scores$var<- i_scores$var %>% as.factor()
#Plotting the bar and polar charts for comparing variables
i_bar <- ggplot(data = i_scores) + 
  geom_bar(
    stat = "identity",#it leaves the data without count and bin
    mapping = aes(x = var, y=Overall, fill = var), 
    show.legend = FALSE,
    width = 1
  ) + 
  labs(x = NULL, y = NULL)
i_bar + coord_polar() + theme_minimal()

i_bar + coord_flip() + theme_minimal()

From here, we can clearly see that redhift is the most important feature that helped us classifying the data.So this is the best in terms of splitting ability. Apart from that plat and the pcas we made also plays important role in classifying out data. Where as ra and dec had almost no role whatsoever in classyfing the data. It was pretty evedent to us as well as we had already visualized it using scatterplot and most of the data were overlaped. So these are the features with lowest importance.

Summary and SWOT analysis:

We tried different machine learning models to classify the objects(stars, galaxies and quasars) in the given telescope data. We trid Knn classifier, xgboost, support vector machine and random forrest classifies. Accuracy of these models were respectively 0.9587, 0.9927, 0.9810, 0.9874 respectively. So according to this, xgboost model does the best for our data while classifying objects.

Strengths: The data provided to us was already almost clean, so we did not have to waste much time cleaning the data. The data also had no missing values which made the whole process a lot easier. Weakness: The ratio of the data is drastic, meaning some objects are way too many (stars and galaxy), where as the object quasar are only a few in data as compared to stars and galaxy. So our concern is that accuracy only might not be the enough measurement of the ability of the models to correctly classify the data.

Opportunity: THe models we formulated were pretty great while predicting the data with accuracy of morethan 90% for all the models. The xgboost model had the accuracy of 98% which was amazing. So we can use these models in future as well to classify these celestial objects. As there is the data about the camera as well, we can perform analysis on them as well.

Threats: The accuracy of the almost all the models are very high, which is a bit concerning for as as no model is perfect. So, we should furthe make auc, roc graphs to analyze the models before using them in a carefree way as accuracy only might not be the good indicator of the model being good.