Introduction

The Sloan Digital Sky Survey is a wide-sweeping telescope survey that releases data to the public in the form of raw data and images. One of these surveys is MaStar, the MaNGA Stellar Library. Packaged with the MaStar data are several error variables, which might lead one to ask: Is it possible to classify errors within the MaStar data based on the surrounding information? All MaStar/MaNGA/SDSS data is publicly available at www.sdss.org.

Data Analysis

To begin data analysis the dataset needs to be understood. Here I retrieve a summary of the dataset and the data header.

mastar = read_csv("Data/mastarall.csv", show_col_types=FALSE)
summary(mastar) # Summarize the data

##     DRPVER            MPROCVER           MANGAID              PLATE      
##  Length:59266       Length:59266       Length:59266       Min.   : 7443  
##  Class :character   Class :character   Class :character   1st Qu.: 8921  
##  Mode  :character   Mode  :character   Mode  :character   Median : 9800  
##                                                           Mean   :10036  
##                                                           3rd Qu.:11254  
##                                                           Max.   :12772  
##                                                                          
##    IFUDESIGN          MJD            IFURA              IFUDEC       
##  Min.   :  701   Min.   :56739   Min.   :  0.0528   Min.   :-32.666  
##  1st Qu.:  706   1st Qu.:57673   1st Qu.:115.2279   1st Qu.:  7.576  
##  Median :  711   Median :58119   Median :194.3851   Median : 28.365  
##  Mean   : 3599   Mean   :58101   Mean   :186.6236   Mean   : 27.857  
##  3rd Qu.: 6102   3rd Qu.:58582   3rd Qu.:261.0807   3rd Qu.: 47.416  
##  Max.   :12705   Max.   :59085   Max.   :359.9651   Max.   : 87.357  
##                                                                      
##     PSFMAG             MNGTARG2            EXPTIME        NEXP_VISIT    
##  Length:59266       Min.   :      132   Min.   : 10.1   Min.   : 1.000  
##  Class :character   1st Qu.:     1280   1st Qu.:900.1   1st Qu.: 3.000  
##  Mode  :character   Median :  8390656   Median :900.1   Median : 4.000  
##                     Mean   : 27139706   Mean   :777.9   Mean   : 6.727  
##                     3rd Qu.: 67108992   3rd Qu.:900.1   3rd Qu.: 6.000  
##                     Max.   :134299648   Max.   :900.2   Max.   :62.000  
##                                                                         
##     NVELGOOD       HELIOV             VERR            V_ERRCODE
##  Min.   :  0   Min.   :-512.50   Min.   :0.003199   Min.   :0  
##  1st Qu.:  9   1st Qu.: -47.90   1st Qu.:0.677605   1st Qu.:0  
##  Median : 12   Median : -10.03   Median :1.124152   Median :0  
##  Mean   : 16   Mean   : -17.12   Mean   :1.477142   Mean   :0  
##  3rd Qu.: 17   3rd Qu.:  21.87   3rd Qu.:1.860484   3rd Qu.:0  
##  Max.   :135   Max.   : 488.71   Max.   :9.987876   Max.   :0  
##                                                                
##   HELIOV_VISIT        VERR_VISIT       V_ERRCODE_VISIT       VELVARFLAG    
##  Min.   :-520.317   Min.   :  0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.: -47.765   1st Qu.:  0.6357   1st Qu.:0.0000000   1st Qu.:0.0000  
##  Median :  -9.952   Median :  1.3248   Median :0.0000000   Median :0.0000  
##  Mean   : -17.002   Mean   :  1.9169   Mean   :0.0005568   Mean   :0.2874  
##  3rd Qu.:  22.202   3rd Qu.:  2.4725   3rd Qu.:0.0000000   3rd Qu.:1.0000  
##  Max.   : 489.544   Max.   :507.9113   Max.   :1.0000000   Max.   :1.0000  
##                                                                            
##    DV_MAXSIG           MJDQUAL         BPRPERR_SP           NEXP_USED     
##  Min.   :  0.0000   Min.   :   0.0   Min.   :-999.00000   Min.   : 1.000  
##  1st Qu.:  0.4372   1st Qu.:   0.0   1st Qu.:   0.00430   1st Qu.: 3.000  
##  Median :  1.5599   Median :   0.0   Median :   0.00813   Median : 4.000  
##  Mean   :  3.1547   Mean   : 624.9   Mean   : -41.07281   Mean   : 6.693  
##  3rd Qu.:  3.3920   3rd Qu.:1024.0   3rd Qu.:   0.01516   3rd Qu.: 6.000  
##  Max.   :169.6528   Max.   :3328.0   Max.   :   0.05000   Max.   :62.000  
##                                      NA's   :8                            
##       S2N               S2N10           BADPIXFRAC             RA          
##  Min.   :   2.814   Min.   :  34.08   Min.   :0.000000   Min.   :  0.0528  
##  1st Qu.:  63.146   1st Qu.:  83.80   1st Qu.:0.000000   1st Qu.:115.2279  
##  Median :  95.904   Median : 126.89   Median :0.001096   Median :194.3851  
##  Mean   : 126.010   Mean   : 165.48   Mean   :0.001126   Mean   :186.6236  
##  3rd Qu.: 159.397   3rd Qu.: 208.42   3rd Qu.:0.001534   3rd Qu.:261.0808  
##  Max.   :1024.016   Max.   :1386.06   Max.   :0.019724   Max.   :359.9651  
##                                                                            
##       DEC              EPOCH      COORD_SOURCE         PHOTOCAT        
##  Min.   :-32.666   Min.   :1999   Length:59266       Length:59266      
##  1st Qu.:  7.576   1st Qu.:2012   Class :character   Class :character  
##  Median : 28.365   Median :2012   Mode  :character   Mode  :character  
##  Mean   : 27.857   Mean   :2010                                        
##  3rd Qu.: 47.416   3rd Qu.:2012                                        
##  Max.   : 87.357   Max.   :2016                                        
##

head(mastar) # Get the first lines of the data

## # A tibble: 6 × 32
##   DRPVER MPROCVER MANGAID PLATE IFUDESIGN   MJD IFURA IFUDEC PSFMAG     MNGTARG2
##   <chr>  <chr>    <chr>   <dbl>     <dbl> <dbl> <dbl>  <dbl> <chr>         <dbl>
## 1 v3_1_1 v1_7_7   5-66031 10001       701 57372  134.   56.4 [17.9317 …  8390656
## 2 v3_1_1 v1_7_7   5-66031 10001       701 57373  134.   56.4 [17.9317 …  8390656
## 3 v3_1_1 v1_7_7   5-12626 10001       702 57372  136.   57.6 [17.3449 …  8390656
## 4 v3_1_1 v1_7_7   5-12626 10001       702 57373  136.   57.6 [17.3449 …  8390656
## 5 v3_1_1 v1_7_7   5-66039 10001       703 57372  134.   57.9 [17.4606 …  8390656
## 6 v3_1_1 v1_7_7   5-66039 10001       703 57373  134.   57.9 [17.4606 …  8390656
## # ℹ 22 more variables: EXPTIME <dbl>, NEXP_VISIT <dbl>, NVELGOOD <dbl>,
## #   HELIOV <dbl>, VERR <dbl>, V_ERRCODE <dbl>, HELIOV_VISIT <dbl>,
## #   VERR_VISIT <dbl>, V_ERRCODE_VISIT <dbl>, VELVARFLAG <dbl>, DV_MAXSIG <dbl>,
## #   MJDQUAL <dbl>, BPRPERR_SP <dbl>, NEXP_USED <dbl>, S2N <dbl>, S2N10 <dbl>,
## #   BADPIXFRAC <dbl>, RA <dbl>, DEC <dbl>, EPOCH <dbl>, COORD_SOURCE <chr>,
## #   PHOTOCAT <chr>

From the data summary and header, we can see that there are a lot of columns, and several of them are character class, several are floats, and several are integers. We should only be selecting the numeric columns and the MaNGA ID column for this project. That is done below.

mastar_refined <- mastar |> 
  select(c(MANGAID, MJD, IFURA, IFUDEC, MNGTARG2, EXPTIME, NEXP_VISIT, NVELGOOD, HELIOV, VERR, HELIOV_VISIT, VERR_VISIT, V_ERRCODE_VISIT, DV_MAXSIG))

mastar_refined

## # A tibble: 59,266 × 14
##    MANGAID    MJD IFURA IFUDEC MNGTARG2 EXPTIME NEXP_VISIT NVELGOOD HELIOV  VERR
##    <chr>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>      <dbl>    <dbl>  <dbl> <dbl>
##  1 5-66031  57372  134.   56.4  8390656    900.          3        9  -81.5  4.41
##  2 5-66031  57373  134.   56.4  8390656    900.          6        9  -81.5  4.41
##  3 5-12626  57372  136.   57.6  8390656    900.          3        9 -226.   2.16
##  4 5-12626  57373  136.   57.6  8390656    900.          6        9 -226.   2.16
##  5 5-66039  57372  134.   57.9  8390656    900.          3        9 -257.   1.60
##  6 5-66039  57373  134.   57.9  8390656    900.          6        9 -257.   1.60
##  7 5-108715 57372  133.   58.0  8390656    900.          3        9  -29.8  2.52
##  8 5-108715 57373  133.   58.0  8390656    900.          6        9  -29.8  2.52
##  9 5-108718 57372  133.   58.3  8390656    900.          3        9   14.6  1.48
## 10 5-108718 57373  133.   58.3  8390656    900.          6        9   14.6  1.48
## # ℹ 59,256 more rows
## # ℹ 4 more variables: HELIOV_VISIT <dbl>, VERR_VISIT <dbl>,
## #   V_ERRCODE_VISIT <dbl>, DV_MAXSIG <dbl>

Now we have the columns we need, these are the timing data, error data, astrophysical measurements, and measurements of the stars. Each of these can now be used with a regression model to determine links between variables and the error variables. Our error variables are VERR, VERR_VISIT, and V_ERRCODE_VISIT. We can still explore this data a bit further though.

For example, we can group the data by stars, halving the number of lines on the dataset (roughly halved).

mastar_stars <- mastar_refined |> 
  group_by(MANGAID) |> 
  summarize(across(everything(), mean, na.rm=FALSE))

## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(everything(), mean, na.rm = FALSE)`.
## ℹ In group 1: `MANGAID = "13-0"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

mastar_stars

## # A tibble: 24,290 × 14
##    MANGAID    MJD IFURA IFUDEC MNGTARG2 EXPTIME NEXP_VISIT NVELGOOD HELIOV  VERR
##    <chr>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>      <dbl>    <dbl>  <dbl> <dbl>
##  1 13-0    56743.  231.   41.9  1050624    900.          5       15  -60.1 0.850
##  2 13-1    56743.  231.   41.5  1050624    900.          5       15  -65.6 1.00 
##  3 13-10   56743.  230.   42.3  1050624    900.          5       15 -197.  1.13 
##  4 13-11   56743.  232.   43.6  1050624    900.          5       15 -185.  0.853
##  5 13-2    56743   231.   42.4  1050624    900.          6       15  -16.3 1.11 
##  6 13-3    56743.  233.   42.6  1050624    900.          5       15 -211.  1.63 
##  7 13-4    56743.  231.   43.0  1050624    900.          5       15  -23.1 1.06 
##  8 13-5    56743.  233.   43.3  1050624    900.          5       15 -145.  1.80 
##  9 13-6    56743.  231.   43.9  1050624    900.          5       15  -95.4 1.44 
## 10 13-7    56743.  229.   42.5  1050624    900.          5       15   25.3 0.746
## # ℹ 24,280 more rows
## # ℹ 4 more variables: HELIOV_VISIT <dbl>, VERR_VISIT <dbl>,
## #   V_ERRCODE_VISIT <dbl>, DV_MAXSIG <dbl>

Now we have the data for each individual star. This is well-refined data already so there is no need to characterize any of the stars as their own factors, since that would require 24,290 factors.

Regression Analysis

Regression analysis of this dataset will come in three models. We will use one model for each of the error variables. Let’s start by setting up each of the linear regression models.

verr <- lm(VERR ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)
verr_visit <- lm(VERR_VISIT ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)
verrcode <- lm(V_ERRCODE_VISIT ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)

summary(verr)

## 
## Call:
## lm(formula = VERR ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + 
##     NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2214 -0.7541 -0.3047  0.3750  9.7019 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.853e+00  9.899e-01  -9.954  < 2e-16 ***
## MJD           1.869e-04  1.682e-05  11.112  < 2e-16 ***
## IFURA        -5.489e-04  8.499e-05  -6.458 1.08e-10 ***
## IFUDEC       -2.523e-03  3.160e-04  -7.985 1.46e-15 ***
## MNGTARG2     -2.187e-09  1.947e-10 -11.234  < 2e-16 ***
## EXPTIME       1.038e-03  4.858e-05  21.358  < 2e-16 ***
## NEXP_VISIT    9.719e-03  1.947e-03   4.991 6.05e-07 ***
## NVELGOOD     -7.419e-03  8.468e-04  -8.760  < 2e-16 ***
## HELIOV       -1.409e-02  2.609e-03  -5.401 6.67e-08 ***
## HELIOV_VISIT  1.327e-02  2.611e-03   5.081 3.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231 on 24280 degrees of freedom
## Multiple R-squared:  0.08324,    Adjusted R-squared:  0.0829 
## F-statistic: 244.9 on 9 and 24280 DF,  p-value: < 2.2e-16

summary(verr_visit)

## 
## Call:
## lm(formula = VERR_VISIT ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + 
##     NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.209  -0.900  -0.277   0.484 237.987 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.239e+01  1.736e+00  -7.139 9.68e-13 ***
## MJD           2.291e-04  2.950e-05   7.768 8.31e-15 ***
## IFURA        -6.734e-04  1.490e-04  -4.519 6.24e-06 ***
## IFUDEC       -3.531e-03  5.540e-04  -6.373 1.89e-10 ***
## MNGTARG2     -3.004e-09  3.413e-10  -8.799  < 2e-16 ***
## EXPTIME       1.637e-03  8.517e-05  19.215  < 2e-16 ***
## NEXP_VISIT    1.474e-03  3.414e-03   0.432    0.666    
## NVELGOOD     -1.528e-03  1.485e-03  -1.029    0.303    
## HELIOV       -1.819e-01  4.574e-03 -39.755  < 2e-16 ***
## HELIOV_VISIT  1.809e-01  4.579e-03  39.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.159 on 24280 degrees of freedom
## Multiple R-squared:  0.1162, Adjusted R-squared:  0.1159 
## F-statistic: 354.8 on 9 and 24280 DF,  p-value: < 2.2e-16

summary(verrcode)

## 
## Call:
## lm(formula = V_ERRCODE_VISIT ~ MJD + IFURA + IFUDEC + MNGTARG2 + 
##     EXPTIME + NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, 
##     data = mastar_stars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02681 -0.00130 -0.00061  0.00007  0.99917 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.913e-02  1.593e-02  -1.201  0.22988    
## MJD           4.075e-07  2.708e-07   1.505  0.13231    
## IFURA        -6.475e-06  1.368e-06  -4.734 2.21e-06 ***
## IFUDEC        9.355e-06  5.085e-06   1.840  0.06581 .  
## MNGTARG2     -1.350e-11  3.133e-12  -4.309 1.65e-05 ***
## EXPTIME      -2.406e-06  7.817e-07  -3.078  0.00208 ** 
## NEXP_VISIT    4.386e-05  3.134e-05   1.400  0.16162    
## NVELGOOD     -8.253e-05  1.363e-05  -6.056 1.42e-09 ***
## HELIOV       -1.223e-04  4.199e-05  -2.913  0.00358 ** 
## HELIOV_VISIT  1.234e-04  4.202e-05   2.936  0.00333 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01982 on 24280 degrees of freedom
## Multiple R-squared:  0.003767,   Adjusted R-squared:  0.003398 
## F-statistic:  10.2 on 9 and 24280 DF,  p-value: 7.594e-16

So, there’s a lot of data here and there’s a lot of data to suggest all of the selected variables are extremely significant. The model has so much data and so few errors that explaining those errors seems easy specifically for the VERR datapoint. For each of the models let’s isolate some of the more significant variables:

verr <- lm(VERR ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)
verr_visit <- lm(VERR_VISIT ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + HELIOV + HELIOV_VISIT, data = mastar_stars)
verrcode <- lm(V_ERRCODE_VISIT ~ IFURA + MNGTARG2 + EXPTIME + NVELGOOD, data = mastar_stars)

summary(verr)

## 
## Call:
## lm(formula = VERR ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + 
##     NEXP_VISIT + NVELGOOD + HELIOV + HELIOV_VISIT, data = mastar_stars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2214 -0.7541 -0.3047  0.3750  9.7019 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.853e+00  9.899e-01  -9.954  < 2e-16 ***
## MJD           1.869e-04  1.682e-05  11.112  < 2e-16 ***
## IFURA        -5.489e-04  8.499e-05  -6.458 1.08e-10 ***
## IFUDEC       -2.523e-03  3.160e-04  -7.985 1.46e-15 ***
## MNGTARG2     -2.187e-09  1.947e-10 -11.234  < 2e-16 ***
## EXPTIME       1.038e-03  4.858e-05  21.358  < 2e-16 ***
## NEXP_VISIT    9.719e-03  1.947e-03   4.991 6.05e-07 ***
## NVELGOOD     -7.419e-03  8.468e-04  -8.760  < 2e-16 ***
## HELIOV       -1.409e-02  2.609e-03  -5.401 6.67e-08 ***
## HELIOV_VISIT  1.327e-02  2.611e-03   5.081 3.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231 on 24280 degrees of freedom
## Multiple R-squared:  0.08324,    Adjusted R-squared:  0.0829 
## F-statistic: 244.9 on 9 and 24280 DF,  p-value: < 2.2e-16

summary(verr_visit)

## 
## Call:
## lm(formula = VERR_VISIT ~ MJD + IFURA + IFUDEC + MNGTARG2 + EXPTIME + 
##     HELIOV + HELIOV_VISIT, data = mastar_stars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.150  -0.900  -0.278   0.483 238.008 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.258e+01  1.726e+00  -7.286 3.28e-13 ***
## MJD           2.319e-04  2.936e-05   7.899 2.93e-15 ***
## IFURA        -6.591e-04  1.478e-04  -4.458 8.32e-06 ***
## IFUDEC       -3.555e-03  5.530e-04  -6.429 1.31e-10 ***
## MNGTARG2     -3.005e-09  3.408e-10  -8.819  < 2e-16 ***
## EXPTIME       1.651e-03  5.092e-05  32.422  < 2e-16 ***
## HELIOV       -1.818e-01  4.572e-03 -39.755  < 2e-16 ***
## HELIOV_VISIT  1.808e-01  4.576e-03  39.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.159 on 24282 degrees of freedom
## Multiple R-squared:  0.1162, Adjusted R-squared:  0.1159 
## F-statistic: 456.1 on 7 and 24282 DF,  p-value: < 2.2e-16

summary(verrcode)

## 
## Call:
## lm(formula = V_ERRCODE_VISIT ~ IFURA + MNGTARG2 + EXPTIME + NVELGOOD, 
##     data = mastar_stars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.00560 -0.00120 -0.00057 -0.00003  0.99833 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.842e-03  6.351e-04   9.198  < 2e-16 ***
## IFURA       -6.492e-06  1.332e-06  -4.875 1.09e-06 ***
## MNGTARG2    -1.330e-11  3.113e-12  -4.273 1.94e-05 ***
## EXPTIME     -3.506e-06  5.428e-07  -6.459 1.07e-10 ***
## NVELGOOD    -7.324e-05  1.195e-05  -6.128 9.03e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01982 on 24285 degrees of freedom
## Multiple R-squared:  0.00312,    Adjusted R-squared:  0.002956 
## F-statistic:    19 on 4 and 24285 DF,  p-value: 1.291e-15

It is important to note that the R-Squared values, both Multiple and Adjusted are far too low to insist on any correlation or causation within the data. Each of these models is fairly rudimentary and the models themselves, even when refined, only are able to explain a maximum of 6% of the variation in error status. But before jumping to this conclusion let’s see something:

# Are the R-Squared values actually high for the dataset? 
# Understanding the error percentage can help greatly.

error_percent = 100*(sum(mastar_stars$V_ERRCODE_VISIT!=0)/sum(mastar_stars$V_ERRCODE_VISIT==0))

error_percent

## [1] 0.1195334

So for the error codes specifically, we can explain a decent amount of the errors. There are very few errors in the data, only about 0.1195% of the data, so being able to even explain the smallest amount of tat variation can be useful.

Model Assumptions and Diagnostics

Next let’s talk about the model diagnostics.

# As an example, take the most significant values from VERR_VISIT, our most explained variable
plot(mastar_stars$HELIOV, mastar_stars$VERR_VISIT,
     xlab="HELIOV", ylab="VERR_VISIT", main="Heliocentric Velocity vs Errors")
abline(verr_visit, col=1, lwd=2)

## Warning in abline(verr_visit, col = 1, lwd = 2): only using the first two of 8
## regression coefficients

Looking at the graph of heliocentric velocity against errors there is something of a small pattern. The data seems to suggest that error rate increases as heliocentric velocity nears 0 and again around -200 with many errors with VERR_VISIT being over 10. Let’s attempt to look at the residuals

# Investigate potential time trend
plot(resid(verr), type="b", main="VERR Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

plot(resid(verrcode), type="b", main="V_ERRCODE_VISIT Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

plot(resid(verr_visit), type="b", main="VERR_VISIT Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

# Residual Plotting
par(mfrow=c(2,2)); plot(verr); par(mfrow=c(1,1))

par(mfrow=c(2,2)); plot(verrcode); par(mfrow=c(1,1))

par(mfrow=c(2,2)); plot(verr_visit); par(mfrow=c(1,1))

From here we can see that there is no time correlation in our dataset, this means we can do back and remove the MJD value from the data where we need to. If we remove that variable from all of the models we should expect different graphs for the Residual vs. Order.

Additionally, the VERR_VISIT Residuals graph has the best figure, with all of the scatterplots mostly conforming to the model’s line, with large clumps int he middle where the majority of observations occur. It is important that we recognize this can be flattened by removing outliers, but we are trying to characterize the outliers in this project.

Conclusion and Future Direction

So, now that we have three different models with varying accuracies, we should choose one to go off of. My choice would be the VERR_VISIT model as it is based on the error from each visit to the star by the observatory. Undertanding that the VERR and V_ERRCODE_VISIT variables are sparsely populated thanks to good engineering and good astronomers operating the telescope, VERR_VISIT is the best shot we have at characterizing the errors themselves.

From this investigation, we can use and refine the models to get a more honed picture of what causes the VERR_VISIT variable to change. From our data, we know the perpetrator is likely the heliocentric velocity and we know that the SDSS survey is more challenged at heliocentric velocities of 0 and -200 km/s.

References

Funding for the Sloan Digital Sky Survey V has been provided by the Alfred P. Sloan Foundation, the Heising-Simons Foundation, the National Science Foundation, and the Participating Institutions. SDSS acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. SDSS telescopes are located at Apache Point Observatory, funded by the Astrophysical Research Consortium and operated by New Mexico State University, and at Las Campanas Observatory, operated by the Carnegie Institution for Science. The SDSS web site is www.sdss.org.

SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration, including the Carnegie Institution for Science, Chilean National Time Allocation Committee (CNTAC) ratified researchers, Caltech, the Gotham Participation Group, Harvard University, Heidelberg University, The Flatiron Institute, The Johns Hopkins University, L’Ecole polytechnique fédérale de Lausanne (EPFL), Leibniz-Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Extraterrestrische Physik (MPE), Nanjing University, National Astronomical Observatories of China (NAOC), New Mexico State University, The Ohio State University, Pennsylvania State University, Smithsonian Astrophysical Observatory, Space Telescope Science Institute (STScI), the Stellar Astrophysics Participation Group, Universidad Nacional Autónoma de México (UNAM), University of Arizona, University of Colorado Boulder, University of Illinois at Urbana-Champaign, University of Toronto, University of Utah, University of Virginia, Yale University, and Yunnan University.

Characterization of MaStar Visits Using Machine Learning in R

Rei Johnson

2026-04-30