Introduction

For this project, I am using a refined version of the MaStar stellar catalogue. Even though the catalogue is refined, there are still tens of variables that we should attempt to relate to each other. Therefore I ask, What are the strongest relations between variables in the stellar catalog?

To answer that question I will pass 24,290 stars through a Principle COmponent Analysis in an attempt to describe a relationship between variables in the stellar catalog.

Data Analysis

Data analysis begins with extracting our dataset from the csv file.

mastar = read_csv("Data/mastarall.csv", show_col_types=FALSE)
summary(mastar) # Summarize the data

##     DRPVER            MPROCVER           MANGAID              PLATE      
##  Length:59266       Length:59266       Length:59266       Min.   : 7443  
##  Class :character   Class :character   Class :character   1st Qu.: 8921  
##  Mode  :character   Mode  :character   Mode  :character   Median : 9800  
##                                                           Mean   :10036  
##                                                           3rd Qu.:11254  
##                                                           Max.   :12772  
##                                                                          
##    IFUDESIGN          MJD            IFURA              IFUDEC       
##  Min.   :  701   Min.   :56739   Min.   :  0.0528   Min.   :-32.666  
##  1st Qu.:  706   1st Qu.:57673   1st Qu.:115.2279   1st Qu.:  7.576  
##  Median :  711   Median :58119   Median :194.3851   Median : 28.365  
##  Mean   : 3599   Mean   :58101   Mean   :186.6236   Mean   : 27.857  
##  3rd Qu.: 6102   3rd Qu.:58582   3rd Qu.:261.0807   3rd Qu.: 47.416  
##  Max.   :12705   Max.   :59085   Max.   :359.9651   Max.   : 87.357  
##                                                                      
##     PSFMAG             MNGTARG2            EXPTIME        NEXP_VISIT    
##  Length:59266       Min.   :      132   Min.   : 10.1   Min.   : 1.000  
##  Class :character   1st Qu.:     1280   1st Qu.:900.1   1st Qu.: 3.000  
##  Mode  :character   Median :  8390656   Median :900.1   Median : 4.000  
##                     Mean   : 27139706   Mean   :777.9   Mean   : 6.727  
##                     3rd Qu.: 67108992   3rd Qu.:900.1   3rd Qu.: 6.000  
##                     Max.   :134299648   Max.   :900.2   Max.   :62.000  
##                                                                         
##     NVELGOOD       HELIOV             VERR            V_ERRCODE
##  Min.   :  0   Min.   :-512.50   Min.   :0.003199   Min.   :0  
##  1st Qu.:  9   1st Qu.: -47.90   1st Qu.:0.677605   1st Qu.:0  
##  Median : 12   Median : -10.03   Median :1.124152   Median :0  
##  Mean   : 16   Mean   : -17.12   Mean   :1.477142   Mean   :0  
##  3rd Qu.: 17   3rd Qu.:  21.87   3rd Qu.:1.860484   3rd Qu.:0  
##  Max.   :135   Max.   : 488.71   Max.   :9.987876   Max.   :0  
##                                                                
##   HELIOV_VISIT        VERR_VISIT       V_ERRCODE_VISIT       VELVARFLAG    
##  Min.   :-520.317   Min.   :  0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.: -47.765   1st Qu.:  0.6357   1st Qu.:0.0000000   1st Qu.:0.0000  
##  Median :  -9.952   Median :  1.3248   Median :0.0000000   Median :0.0000  
##  Mean   : -17.002   Mean   :  1.9169   Mean   :0.0005568   Mean   :0.2874  
##  3rd Qu.:  22.202   3rd Qu.:  2.4725   3rd Qu.:0.0000000   3rd Qu.:1.0000  
##  Max.   : 489.544   Max.   :507.9113   Max.   :1.0000000   Max.   :1.0000  
##                                                                            
##    DV_MAXSIG           MJDQUAL         BPRPERR_SP           NEXP_USED     
##  Min.   :  0.0000   Min.   :   0.0   Min.   :-999.00000   Min.   : 1.000  
##  1st Qu.:  0.4372   1st Qu.:   0.0   1st Qu.:   0.00430   1st Qu.: 3.000  
##  Median :  1.5599   Median :   0.0   Median :   0.00813   Median : 4.000  
##  Mean   :  3.1547   Mean   : 624.9   Mean   : -41.07281   Mean   : 6.693  
##  3rd Qu.:  3.3920   3rd Qu.:1024.0   3rd Qu.:   0.01516   3rd Qu.: 6.000  
##  Max.   :169.6528   Max.   :3328.0   Max.   :   0.05000   Max.   :62.000  
##                                      NA's   :8                            
##       S2N               S2N10           BADPIXFRAC             RA          
##  Min.   :   2.814   Min.   :  34.08   Min.   :0.000000   Min.   :  0.0528  
##  1st Qu.:  63.146   1st Qu.:  83.80   1st Qu.:0.000000   1st Qu.:115.2279  
##  Median :  95.904   Median : 126.89   Median :0.001096   Median :194.3851  
##  Mean   : 126.010   Mean   : 165.48   Mean   :0.001126   Mean   :186.6236  
##  3rd Qu.: 159.397   3rd Qu.: 208.42   3rd Qu.:0.001534   3rd Qu.:261.0808  
##  Max.   :1024.016   Max.   :1386.06   Max.   :0.019724   Max.   :359.9651  
##                                                                            
##       DEC              EPOCH      COORD_SOURCE         PHOTOCAT        
##  Min.   :-32.666   Min.   :1999   Length:59266       Length:59266      
##  1st Qu.:  7.576   1st Qu.:2012   Class :character   Class :character  
##  Median : 28.365   Median :2012   Mode  :character   Mode  :character  
##  Mean   : 27.857   Mean   :2010                                        
##  3rd Qu.: 47.416   3rd Qu.:2012                                        
##  Max.   : 87.357   Max.   :2016                                        
##

head(mastar) # Get the first lines of the data

## # A tibble: 6 × 32
##   DRPVER MPROCVER MANGAID PLATE IFUDESIGN   MJD IFURA IFUDEC PSFMAG     MNGTARG2
##   <chr>  <chr>    <chr>   <dbl>     <dbl> <dbl> <dbl>  <dbl> <chr>         <dbl>
## 1 v3_1_1 v1_7_7   5-66031 10001       701 57372  134.   56.4 [17.9317 …  8390656
## 2 v3_1_1 v1_7_7   5-66031 10001       701 57373  134.   56.4 [17.9317 …  8390656
## 3 v3_1_1 v1_7_7   5-12626 10001       702 57372  136.   57.6 [17.3449 …  8390656
## 4 v3_1_1 v1_7_7   5-12626 10001       702 57373  136.   57.6 [17.3449 …  8390656
## 5 v3_1_1 v1_7_7   5-66039 10001       703 57372  134.   57.9 [17.4606 …  8390656
## 6 v3_1_1 v1_7_7   5-66039 10001       703 57373  134.   57.9 [17.4606 …  8390656
## # ℹ 22 more variables: EXPTIME <dbl>, NEXP_VISIT <dbl>, NVELGOOD <dbl>,
## #   HELIOV <dbl>, VERR <dbl>, V_ERRCODE <dbl>, HELIOV_VISIT <dbl>,
## #   VERR_VISIT <dbl>, V_ERRCODE_VISIT <dbl>, VELVARFLAG <dbl>, DV_MAXSIG <dbl>,
## #   MJDQUAL <dbl>, BPRPERR_SP <dbl>, NEXP_USED <dbl>, S2N <dbl>, S2N10 <dbl>,
## #   BADPIXFRAC <dbl>, RA <dbl>, DEC <dbl>, EPOCH <dbl>, COORD_SOURCE <chr>,
## #   PHOTOCAT <chr>

We can see that there are a total of 32 columns in this dataset and a total of 59,266 observations. This data will later be passed into the PCA algorithm however we need to clean it first. To clean the data I will maintain the ID of each star but remove all other identification and character based columns.

mastar_refined <- mastar |> 
  select(c(MANGAID, where(is.numeric)))

mastar_refined

## # A tibble: 59,266 × 27
##    MANGAID  PLATE IFUDESIGN   MJD IFURA IFUDEC MNGTARG2 EXPTIME NEXP_VISIT
##    <chr>    <dbl>     <dbl> <dbl> <dbl>  <dbl>    <dbl>   <dbl>      <dbl>
##  1 5-66031  10001       701 57372  134.   56.4  8390656    900.          3
##  2 5-66031  10001       701 57373  134.   56.4  8390656    900.          6
##  3 5-12626  10001       702 57372  136.   57.6  8390656    900.          3
##  4 5-12626  10001       702 57373  136.   57.6  8390656    900.          6
##  5 5-66039  10001       703 57372  134.   57.9  8390656    900.          3
##  6 5-66039  10001       703 57373  134.   57.9  8390656    900.          6
##  7 5-108715 10001       704 57372  133.   58.0  8390656    900.          3
##  8 5-108715 10001       704 57373  133.   58.0  8390656    900.          6
##  9 5-108718 10001       705 57372  133.   58.3  8390656    900.          3
## 10 5-108718 10001       705 57373  133.   58.3  8390656    900.          6
## # ℹ 59,256 more rows
## # ℹ 18 more variables: NVELGOOD <dbl>, HELIOV <dbl>, VERR <dbl>,
## #   V_ERRCODE <dbl>, HELIOV_VISIT <dbl>, VERR_VISIT <dbl>,
## #   V_ERRCODE_VISIT <dbl>, VELVARFLAG <dbl>, DV_MAXSIG <dbl>, MJDQUAL <dbl>,
## #   BPRPERR_SP <dbl>, NEXP_USED <dbl>, S2N <dbl>, S2N10 <dbl>,
## #   BADPIXFRAC <dbl>, RA <dbl>, DEC <dbl>, EPOCH <dbl>

These are all the variables we need for the analysis. So we can go into some further digging around the data. Starting with investigating some of the columns. Specifically investigating the epxosure columns and deciding to omit them from the PCA.

head(mastar_refined$EXPTIME)

## [1] 900.080 900.095 900.080 900.095 900.080 900.095

head(mastar_refined$NEXP_VISIT)

## [1] 3 6 3 6 3 6

From this we can see the completely linear relationship, this means the number of exposures and the exposure time are constatnt or relatively constant over the survey. In fact, NEXP_VISIT will always be 3 or 6, whereas the EXPTIME will always be 900.080 or 900.095. We should drop this from PCA since there will be a 1:1 relationship and that provides no valuable data.

Finally impute all 8 NAs with the mean for the variable:

# Fixing NA Values
sum(is.na(mastar_refined))

## [1] 8

mastar_refined <- mastar_refined |> 
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)))

sum(is.na(mastar_refined))

## [1] 0

Statistical Analysis

We have established a basis for our statistical analysis so let’s pursue PCA now.

# Standardizing
scaled_mastar <- mastar_refined |> 
  select(c(MJD, MNGTARG2, HELIOV, VERR_VISIT, HELIOV_VISIT, DV_MAXSIG, BPRPERR_SP, S2N,, S2N10, BADPIXFRAC)) |> 
  na.omit(scaled_mastar) |>
  scale()

# PCA
mastar_pca <- prcomp(scaled_mastar)

mastar_pca

## Standard deviations (1, .., p=10):
##  [1] 1.51711520 1.40641821 1.24509306 1.03696799 0.96968571 0.95350936
##  [7] 0.88025014 0.63148131 0.26477941 0.04002956
## 
## Rotation (n x k) = (10 x 10):
##                     PC1         PC2         PC3         PC4          PC5
## MJD           0.2293771  0.10998527 -0.06602767  0.63794554 -0.192172053
## MNGTARG2      0.2105345  0.01922726 -0.09015483  0.68248226  0.308291218
## HELIOV        0.2760963 -0.63036810  0.15159638 -0.01543294  0.005544804
## VERR_VISIT   -0.2056607 -0.05229600  0.21935119  0.16437984 -0.457768992
## HELIOV_VISIT  0.2753590 -0.63054239  0.15287451 -0.01364212  0.001927515
## DV_MAXSIG     0.1836396  0.05635443 -0.19484361 -0.15622974  0.698364613
## BPRPERR_SP    0.1720694  0.18756744  0.64529285 -0.01290407  0.131572105
## S2N           0.5633414  0.21478127 -0.13608558 -0.17276652 -0.228519886
## S2N10         0.5622895  0.20784001 -0.12890473 -0.21300968 -0.277613523
## BADPIXFRAC   -0.1088982 -0.24932953 -0.63525021  0.00167355 -0.155320828
##                        PC6         PC7         PC8          PC9          PC10
## MJD           0.1108268285 -0.68529333  0.03734641 -0.060998071  0.0002589282
## MNGTARG2     -0.1201498314  0.60139605 -0.03211092  0.095956348  0.0004478476
## HELIOV        0.0009943688 -0.04026698 -0.03822525 -0.007559279  0.7071179441
## VERR_VISIT    0.7597434873  0.30378035 -0.02535220  0.008257156  0.0087805167
## HELIOV_VISIT  0.0127261767 -0.03673493 -0.03964846 -0.007144394 -0.7070352273
## DV_MAXSIG     0.6247945852 -0.15005349  0.01061446 -0.003517369  0.0025338098
## BPRPERR_SP   -0.0257699356  0.01185455  0.70741251 -0.009607736 -0.0007814327
## S2N           0.0307115919  0.20444986 -0.04231485 -0.700994125  0.0005249470
## S2N10         0.0510538352  0.05827503 -0.01610205  0.703783928  0.0007968765
## BADPIXFRAC    0.0364178802  0.07476738  0.70093030 -0.009272971 -0.0003065024

We have gotten 10 PCs as a response from the program. Here are the variance explained percentages for the components:

library(broom)
mastar_pca |> 
  tidy(matrix="eigenvalues")

## # A tibble: 10 × 4
##       PC std.dev percent cumulative
##    <dbl>   <dbl>   <dbl>      <dbl>
##  1     1  1.52   0.230        0.230
##  2     2  1.41   0.198        0.428
##  3     3  1.25   0.155        0.583
##  4     4  1.04   0.108        0.691
##  5     5  0.970  0.0940       0.785
##  6     6  0.954  0.0909       0.875
##  7     7  0.880  0.0775       0.953
##  8     8  0.631  0.0399       0.993
##  9     9  0.265  0.00701      1.000
## 10    10  0.0400 0.00016      1

The first two components account for 42% of the variance in the dataset. Next let’s graph these two components to see a relationship between the two and better understand the data. We know the following about PC1 and PC2:

PC1 is a relationship between the signal to noise ratio (0.56) and the -1 sigma uncertainty in the median heliocentric velocity (-0.2)

PC2 is a relationship between signal to noise ratio (0.2) to the heliocentric velocity (-0.63)

Graphing:

mastar_pca |>
  # Add PCs to the original dataset
  augment(mastar_refined) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes()) +

  # Add the PCA1 and PCA2 arrows
  geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC1 arrow
  geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC2 arrow

  # Add text labels
  geom_text(aes(x = 5, y = 0, label = "PC1"), 
            vjust = -0.5, color = "black") +  

  geom_text(aes(x = 0, y = 4, label = "PC2"), 
            hjust = -0.5, color = "black") +  

  # Customize the plot
  xlab("PC1") +
  ylab("PC2") +
  theme_minimal()

## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

The returned graph is borderline unreadable with the quantity of data, so let’s plot the rotation matrix:

#Rotation Matrix

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)
mastar_pca |>
  # extract rotation matrix
  tidy(matrix = "rotation") |>
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  xlim(-1, 0.5) + ylim(-1, 0.7) + 
  coord_fixed()+
  theme_minimal()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_segment()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).

Conclusion and Future Direction

From a principle component analysis of the data we know that there is a very complicated relationship between the values, hwoever about 43% of the dataset’s variance can be described by a relationship between heliocentric velocity and the signal to noise ratio of the data. This makes complete sense because with a higher velocity you are more likely to encounter noise from moving objects, noise from tracking systems, and noise from space dust particles and atmospheric effects.

In the future this data can be used to determine the larger issues with the Sloan Digital Sky Survey.

References

Funding for the Sloan Digital Sky Survey V has been provided by the Alfred P. Sloan Foundation, the Heising-Simons Foundation, the National Science Foundation, and the Participating Institutions. SDSS acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. SDSS telescopes are located at Apache Point Observatory, funded by the Astrophysical Research Consortium and operated by New Mexico State University, and at Las Campanas Observatory, operated by the Carnegie Institution for Science. The SDSS web site is www.sdss.org.

SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration, including the Carnegie Institution for Science, Chilean National Time Allocation Committee (CNTAC) ratified researchers, Caltech, the Gotham Participation Group, Harvard University, Heidelberg University, The Flatiron Institute, The Johns Hopkins University, L’Ecole polytechnique fédérale de Lausanne (EPFL), Leibniz-Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Extraterrestrische Physik (MPE), Nanjing University, National Astronomical Observatories of China (NAOC), New Mexico State University, The Ohio State University, Pennsylvania State University, Smithsonian Astrophysical Observatory, Space Telescope Science Institute (STScI), the Stellar Astrophysics Participation Group, Universidad Nacional Autónoma de México (UNAM), University of Arizona, University of Colorado Boulder, University of Illinois at Urbana-Champaign, University of Toronto, University of Utah, University of Virginia, Yale University, and Yunnan University.

A Principle Component Analysis of Star Spectroscopy in R

Rei Johnson

2026-05-07

Introduction

Data Analysis

Statistical Analysis

Conclusion and Future Direction

References