For this project, I am using a refined version of the MaStar stellar catalogue. Even though the catalogue is refined, there are still tens of variables that we should attempt to relate to each other. Therefore I ask, What are the strongest relations between variables in the stellar catalog?
To answer that question I will pass 24,290 stars through a Principle COmponent Analysis in an attempt to describe a relationship between variables in the stellar catalog.
Data analysis begins with extracting our dataset from the csv file.
mastar = read_csv("Data/mastarall.csv", show_col_types=FALSE)
summary(mastar) # Summarize the data
## DRPVER MPROCVER MANGAID PLATE
## Length:59266 Length:59266 Length:59266 Min. : 7443
## Class :character Class :character Class :character 1st Qu.: 8921
## Mode :character Mode :character Mode :character Median : 9800
## Mean :10036
## 3rd Qu.:11254
## Max. :12772
##
## IFUDESIGN MJD IFURA IFUDEC
## Min. : 701 Min. :56739 Min. : 0.0528 Min. :-32.666
## 1st Qu.: 706 1st Qu.:57673 1st Qu.:115.2279 1st Qu.: 7.576
## Median : 711 Median :58119 Median :194.3851 Median : 28.365
## Mean : 3599 Mean :58101 Mean :186.6236 Mean : 27.857
## 3rd Qu.: 6102 3rd Qu.:58582 3rd Qu.:261.0807 3rd Qu.: 47.416
## Max. :12705 Max. :59085 Max. :359.9651 Max. : 87.357
##
## PSFMAG MNGTARG2 EXPTIME NEXP_VISIT
## Length:59266 Min. : 132 Min. : 10.1 Min. : 1.000
## Class :character 1st Qu.: 1280 1st Qu.:900.1 1st Qu.: 3.000
## Mode :character Median : 8390656 Median :900.1 Median : 4.000
## Mean : 27139706 Mean :777.9 Mean : 6.727
## 3rd Qu.: 67108992 3rd Qu.:900.1 3rd Qu.: 6.000
## Max. :134299648 Max. :900.2 Max. :62.000
##
## NVELGOOD HELIOV VERR V_ERRCODE
## Min. : 0 Min. :-512.50 Min. :0.003199 Min. :0
## 1st Qu.: 9 1st Qu.: -47.90 1st Qu.:0.677605 1st Qu.:0
## Median : 12 Median : -10.03 Median :1.124152 Median :0
## Mean : 16 Mean : -17.12 Mean :1.477142 Mean :0
## 3rd Qu.: 17 3rd Qu.: 21.87 3rd Qu.:1.860484 3rd Qu.:0
## Max. :135 Max. : 488.71 Max. :9.987876 Max. :0
##
## HELIOV_VISIT VERR_VISIT V_ERRCODE_VISIT VELVARFLAG
## Min. :-520.317 Min. : 0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.: -47.765 1st Qu.: 0.6357 1st Qu.:0.0000000 1st Qu.:0.0000
## Median : -9.952 Median : 1.3248 Median :0.0000000 Median :0.0000
## Mean : -17.002 Mean : 1.9169 Mean :0.0005568 Mean :0.2874
## 3rd Qu.: 22.202 3rd Qu.: 2.4725 3rd Qu.:0.0000000 3rd Qu.:1.0000
## Max. : 489.544 Max. :507.9113 Max. :1.0000000 Max. :1.0000
##
## DV_MAXSIG MJDQUAL BPRPERR_SP NEXP_USED
## Min. : 0.0000 Min. : 0.0 Min. :-999.00000 Min. : 1.000
## 1st Qu.: 0.4372 1st Qu.: 0.0 1st Qu.: 0.00430 1st Qu.: 3.000
## Median : 1.5599 Median : 0.0 Median : 0.00813 Median : 4.000
## Mean : 3.1547 Mean : 624.9 Mean : -41.07281 Mean : 6.693
## 3rd Qu.: 3.3920 3rd Qu.:1024.0 3rd Qu.: 0.01516 3rd Qu.: 6.000
## Max. :169.6528 Max. :3328.0 Max. : 0.05000 Max. :62.000
## NA's :8
## S2N S2N10 BADPIXFRAC RA
## Min. : 2.814 Min. : 34.08 Min. :0.000000 Min. : 0.0528
## 1st Qu.: 63.146 1st Qu.: 83.80 1st Qu.:0.000000 1st Qu.:115.2279
## Median : 95.904 Median : 126.89 Median :0.001096 Median :194.3851
## Mean : 126.010 Mean : 165.48 Mean :0.001126 Mean :186.6236
## 3rd Qu.: 159.397 3rd Qu.: 208.42 3rd Qu.:0.001534 3rd Qu.:261.0808
## Max. :1024.016 Max. :1386.06 Max. :0.019724 Max. :359.9651
##
## DEC EPOCH COORD_SOURCE PHOTOCAT
## Min. :-32.666 Min. :1999 Length:59266 Length:59266
## 1st Qu.: 7.576 1st Qu.:2012 Class :character Class :character
## Median : 28.365 Median :2012 Mode :character Mode :character
## Mean : 27.857 Mean :2010
## 3rd Qu.: 47.416 3rd Qu.:2012
## Max. : 87.357 Max. :2016
##
head(mastar) # Get the first lines of the data
## # A tibble: 6 × 32
## DRPVER MPROCVER MANGAID PLATE IFUDESIGN MJD IFURA IFUDEC PSFMAG MNGTARG2
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 v3_1_1 v1_7_7 5-66031 10001 701 57372 134. 56.4 [17.9317 … 8390656
## 2 v3_1_1 v1_7_7 5-66031 10001 701 57373 134. 56.4 [17.9317 … 8390656
## 3 v3_1_1 v1_7_7 5-12626 10001 702 57372 136. 57.6 [17.3449 … 8390656
## 4 v3_1_1 v1_7_7 5-12626 10001 702 57373 136. 57.6 [17.3449 … 8390656
## 5 v3_1_1 v1_7_7 5-66039 10001 703 57372 134. 57.9 [17.4606 … 8390656
## 6 v3_1_1 v1_7_7 5-66039 10001 703 57373 134. 57.9 [17.4606 … 8390656
## # ℹ 22 more variables: EXPTIME <dbl>, NEXP_VISIT <dbl>, NVELGOOD <dbl>,
## # HELIOV <dbl>, VERR <dbl>, V_ERRCODE <dbl>, HELIOV_VISIT <dbl>,
## # VERR_VISIT <dbl>, V_ERRCODE_VISIT <dbl>, VELVARFLAG <dbl>, DV_MAXSIG <dbl>,
## # MJDQUAL <dbl>, BPRPERR_SP <dbl>, NEXP_USED <dbl>, S2N <dbl>, S2N10 <dbl>,
## # BADPIXFRAC <dbl>, RA <dbl>, DEC <dbl>, EPOCH <dbl>, COORD_SOURCE <chr>,
## # PHOTOCAT <chr>
We can see that there are a total of 32 columns in this dataset and a total of 59,266 observations. This data will later be passed into the PCA algorithm however we need to clean it first. To clean the data I will maintain the ID of each star but remove all other identification and character based columns.
mastar_refined <- mastar |>
select(c(MANGAID, where(is.numeric)))
mastar_refined
## # A tibble: 59,266 × 27
## MANGAID PLATE IFUDESIGN MJD IFURA IFUDEC MNGTARG2 EXPTIME NEXP_VISIT
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5-66031 10001 701 57372 134. 56.4 8390656 900. 3
## 2 5-66031 10001 701 57373 134. 56.4 8390656 900. 6
## 3 5-12626 10001 702 57372 136. 57.6 8390656 900. 3
## 4 5-12626 10001 702 57373 136. 57.6 8390656 900. 6
## 5 5-66039 10001 703 57372 134. 57.9 8390656 900. 3
## 6 5-66039 10001 703 57373 134. 57.9 8390656 900. 6
## 7 5-108715 10001 704 57372 133. 58.0 8390656 900. 3
## 8 5-108715 10001 704 57373 133. 58.0 8390656 900. 6
## 9 5-108718 10001 705 57372 133. 58.3 8390656 900. 3
## 10 5-108718 10001 705 57373 133. 58.3 8390656 900. 6
## # ℹ 59,256 more rows
## # ℹ 18 more variables: NVELGOOD <dbl>, HELIOV <dbl>, VERR <dbl>,
## # V_ERRCODE <dbl>, HELIOV_VISIT <dbl>, VERR_VISIT <dbl>,
## # V_ERRCODE_VISIT <dbl>, VELVARFLAG <dbl>, DV_MAXSIG <dbl>, MJDQUAL <dbl>,
## # BPRPERR_SP <dbl>, NEXP_USED <dbl>, S2N <dbl>, S2N10 <dbl>,
## # BADPIXFRAC <dbl>, RA <dbl>, DEC <dbl>, EPOCH <dbl>
These are all the variables we need for the analysis. So we can go into some further digging around the data. Starting with investigating some of the columns. Specifically investigating the epxosure columns and deciding to omit them from the PCA.
head(mastar_refined$EXPTIME)
## [1] 900.080 900.095 900.080 900.095 900.080 900.095
head(mastar_refined$NEXP_VISIT)
## [1] 3 6 3 6 3 6
From this we can see the completely linear relationship, this means the number of exposures and the exposure time are constatnt or relatively constant over the survey. In fact, NEXP_VISIT will always be 3 or 6, whereas the EXPTIME will always be 900.080 or 900.095. We should drop this from PCA since there will be a 1:1 relationship and that provides no valuable data.
Finally impute all 8 NAs with the mean for the variable:
# Fixing NA Values
sum(is.na(mastar_refined))
## [1] 8
mastar_refined <- mastar_refined |>
mutate(across(where(is.numeric),
~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)))
sum(is.na(mastar_refined))
## [1] 0
We have established a basis for our statistical analysis so let’s pursue PCA now.
# Standardizing
scaled_mastar <- mastar_refined |>
select(c(MJD, MNGTARG2, HELIOV, VERR_VISIT, HELIOV_VISIT, DV_MAXSIG, BPRPERR_SP, S2N,, S2N10, BADPIXFRAC)) |>
na.omit(scaled_mastar) |>
scale()
# PCA
mastar_pca <- prcomp(scaled_mastar)
mastar_pca
## Standard deviations (1, .., p=10):
## [1] 1.51711520 1.40641821 1.24509306 1.03696799 0.96968571 0.95350936
## [7] 0.88025014 0.63148131 0.26477941 0.04002956
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## MJD 0.2293771 0.10998527 -0.06602767 0.63794554 -0.192172053
## MNGTARG2 0.2105345 0.01922726 -0.09015483 0.68248226 0.308291218
## HELIOV 0.2760963 -0.63036810 0.15159638 -0.01543294 0.005544804
## VERR_VISIT -0.2056607 -0.05229600 0.21935119 0.16437984 -0.457768992
## HELIOV_VISIT 0.2753590 -0.63054239 0.15287451 -0.01364212 0.001927515
## DV_MAXSIG 0.1836396 0.05635443 -0.19484361 -0.15622974 0.698364613
## BPRPERR_SP 0.1720694 0.18756744 0.64529285 -0.01290407 0.131572105
## S2N 0.5633414 0.21478127 -0.13608558 -0.17276652 -0.228519886
## S2N10 0.5622895 0.20784001 -0.12890473 -0.21300968 -0.277613523
## BADPIXFRAC -0.1088982 -0.24932953 -0.63525021 0.00167355 -0.155320828
## PC6 PC7 PC8 PC9 PC10
## MJD 0.1108268285 -0.68529333 0.03734641 -0.060998071 0.0002589282
## MNGTARG2 -0.1201498314 0.60139605 -0.03211092 0.095956348 0.0004478476
## HELIOV 0.0009943688 -0.04026698 -0.03822525 -0.007559279 0.7071179441
## VERR_VISIT 0.7597434873 0.30378035 -0.02535220 0.008257156 0.0087805167
## HELIOV_VISIT 0.0127261767 -0.03673493 -0.03964846 -0.007144394 -0.7070352273
## DV_MAXSIG 0.6247945852 -0.15005349 0.01061446 -0.003517369 0.0025338098
## BPRPERR_SP -0.0257699356 0.01185455 0.70741251 -0.009607736 -0.0007814327
## S2N 0.0307115919 0.20444986 -0.04231485 -0.700994125 0.0005249470
## S2N10 0.0510538352 0.05827503 -0.01610205 0.703783928 0.0007968765
## BADPIXFRAC 0.0364178802 0.07476738 0.70093030 -0.009272971 -0.0003065024
We have gotten 10 PCs as a response from the program. Here are the variance explained percentages for the components:
library(broom)
mastar_pca |>
tidy(matrix="eigenvalues")
## # A tibble: 10 × 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.52 0.230 0.230
## 2 2 1.41 0.198 0.428
## 3 3 1.25 0.155 0.583
## 4 4 1.04 0.108 0.691
## 5 5 0.970 0.0940 0.785
## 6 6 0.954 0.0909 0.875
## 7 7 0.880 0.0775 0.953
## 8 8 0.631 0.0399 0.993
## 9 9 0.265 0.00701 1.000
## 10 10 0.0400 0.00016 1
The first two components account for 42% of the variance in the dataset. Next let’s graph these two components to see a relationship between the two and better understand the data. We know the following about PC1 and PC2:
PC1 is a relationship between the signal to noise ratio (0.56) and the -1 sigma uncertainty in the median heliocentric velocity (-0.2)
PC2 is a relationship between signal to noise ratio (0.2) to the heliocentric velocity (-0.63)
Graphing:
mastar_pca |>
# Add PCs to the original dataset
augment(mastar_refined) |>
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes()) +
# Add the PCA1 and PCA2 arrows
geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC1 arrow
geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC2 arrow
# Add text labels
geom_text(aes(x = 5, y = 0, label = "PC1"),
vjust = -0.5, color = "black") +
geom_text(aes(x = 0, y = 4, label = "PC2"),
hjust = -0.5, color = "black") +
# Customize the plot
xlab("PC1") +
ylab("PC2") +
theme_minimal()
## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 59266 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
The returned graph is borderline unreadable with the quantity of data,
so let’s plot the rotation matrix:
#Rotation Matrix
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
mastar_pca |>
# extract rotation matrix
tidy(matrix = "rotation") |>
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) |>
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column), hjust = 1) +
xlim(-1, 0.5) + ylim(-1, 0.7) +
coord_fixed()+
theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).
From a principle component analysis of the data we know that there is a very complicated relationship between the values, hwoever about 43% of the dataset’s variance can be described by a relationship between heliocentric velocity and the signal to noise ratio of the data. This makes complete sense because with a higher velocity you are more likely to encounter noise from moving objects, noise from tracking systems, and noise from space dust particles and atmospheric effects.
In the future this data can be used to determine the larger issues with the Sloan Digital Sky Survey.
Funding for the Sloan Digital Sky Survey V has been provided by the Alfred P. Sloan Foundation, the Heising-Simons Foundation, the National Science Foundation, and the Participating Institutions. SDSS acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. SDSS telescopes are located at Apache Point Observatory, funded by the Astrophysical Research Consortium and operated by New Mexico State University, and at Las Campanas Observatory, operated by the Carnegie Institution for Science. The SDSS web site is www.sdss.org.
SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration, including the Carnegie Institution for Science, Chilean National Time Allocation Committee (CNTAC) ratified researchers, Caltech, the Gotham Participation Group, Harvard University, Heidelberg University, The Flatiron Institute, The Johns Hopkins University, L’Ecole polytechnique fédérale de Lausanne (EPFL), Leibniz-Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Extraterrestrische Physik (MPE), Nanjing University, National Astronomical Observatories of China (NAOC), New Mexico State University, The Ohio State University, Pennsylvania State University, Smithsonian Astrophysical Observatory, Space Telescope Science Institute (STScI), the Stellar Astrophysics Participation Group, Universidad Nacional Autónoma de México (UNAM), University of Arizona, University of Colorado Boulder, University of Illinois at Urbana-Champaign, University of Toronto, University of Utah, University of Virginia, Yale University, and Yunnan University.