DSLabs Stars Dataset Project

Author

Adi Ve

image credit: Koatuo on HoYoLAB

Load Libraries

library(tidyverse)
library(dslabs)
library(scales)
library(ggplot2)
library(plotly)
library(RColorBrewer)
data("stars")

Introduction

The Hertzsprung-Russell diagram is one of the most well known data visualizations in the field of astronomy and astrophysics. In fact, I recall being instructed on how to construct one in MCPS 8th grade physical science as a project. I had the opportunity earlier this year to take ASTRO100 from the University of Auckland, where I finally learned what these diagrams meant.

This diagram visualizes how most stars pass through their life cycle on the main sequence, and the differing life cycles involve becoming a giant or white dwarf. If I remember correctly, low mass stars often become red giants at the end of their lives, and then shed their outer layers in a planetary nebula, becoming white dwarfs.

With the stars dataset, I noticed that even though the only quantitative variables we have are temperature and magnitude, we have all the data we need to create a Hertzsprung-Russell diagram, as many values such as luminosity and radius can be determined directly from formulae using the magnitude and temperature. Much of this exploration will be technical, and I will refer to all sources as I go along, and do my best to comment on what the code is doing.

image credit: gstraub on shutterstock

Create Reverse Log Transformation

The code for this transformation is taken line for line from Brian Diggs on this stackoverflow thread:
https://stackoverflow.com/questions/11053899/how-to-get-a-reversed-log10-scale-in-ggplot2

reverselog_trans <- function(base = exp(1)) {
    trans <- function(x) -log(x, base)
    inv <- function(x) base^(-x)
    trans_new(paste0("reverselog-", format(base)), trans, inv, 
              log_breaks(base = base), 
              domain = c(1e-100, Inf))
}

Clustering

Honestly, the only thing I know about K-means clustering is from what professor Saidi said in class that it is somehow based on a sort of Euclidean distance in the plane. We try it out to see if it gives anything useful:

stars_cluster <- select(stars, "magnitude", "temp")   # kmeans only wants your
                                                      # quantitative variables

cls <- kmeans(x = stars_cluster, centers = 3)         # split into three groups
stars_cluster$cluster <- as.character(cls$cluster)    # using magic
head(stars_cluster)
  magnitude temp cluster
1       4.8 5840       1
2       1.4 9620       3
3      -3.1 7400       3
4      -0.4 4590       1
5       4.3 5840       1
6       0.5 9900       3
HR_scatter <- ggplot(stars_cluster, aes(x=temp, y=magnitude, color=cluster)) +
  geom_point() +
  scale_y_reverse(breaks = seq(-10, 20, 5)) +         # invert the y axis
  scale_x_continuous(trans=reverselog_trans(10))      # invert and log the x
                                                      # axis

HR_scatter

Clearly the clustering is not the same as the one we want. In fact, some of the groups are clearly extremely far from each other, which confuses me because it honestly just looks like these were bucketed based on their x-axis values.

Instead, we manually classify the stars referencing the groupings on the Hertzsprung Russell diagram from earlier:

HR_scatter <- ggplot(stars, aes(x=temp, y=magnitude)) +
  geom_point() +
  scale_y_reverse(breaks = seq(-10, 20, 5)) +
  scale_x_continuous(trans=reverselog_trans(10))

HR_scatter

# Decide the break points visually by referring to the Hertzsprung-Russell
# diagram reference.

stars_cluster <- mutate(stars, cluster=rep(0,96)) # create a new variable
                                                  # "cluster" with all 0s.

stars_cluster$cluster[(stars$temp>8000)&(stars$magnitude>10)] <- "White Dwarf"
# Give the four mid-temperature low magnitude stars the identifier white dwarf.

stars_cluster$cluster[(stars$temp<6000)&(stars$magnitude<2)
                      &(stars$magnitude>-3)] <- "Red Giant"
# Give the tightly packed cluster lying above the main line the identifier
# red giant.

for (i in 1:96){
  if (stars$temp[i]<15000 & stars$magnitude[i]<(-3)){
    stars_cluster$cluster[i] <- "Supergiant"
  }
}
# Give the cloud of stars above the red giants the identifier supergiant.
# This line was misbehaving really badly and messing up all my data when I
# tried using logical indexing, so I decided to just switch to a for loop,
# which while longer to write is still just as effective.

stars_cluster$cluster[stars_cluster$cluster==0] <- "Main Sequence"
# Give the remaining stars on the main line the identifier main sequence.

Radii of stars

I thought it might be cool to incorporate size into the diagram the way that they did in the one at the top of the document. To do this we need to calculate the radii of the stars, which involves just a couple physical formulae, which I referenced here:
- https://sciencing.com/what-do-stars-look-like-4587447.html
- https://youtu.be/HVJ7yMgsj3s?si=UzO_uWkeigxP02Rt
- https://en.wikipedia.org/wiki/Stefan%E2%80%93Boltzmann_law

sigma = 5.670*10^(-8) # Stefan-Boltzmann Constant
k = 1/(2*sqrt(pi*sigma)) # Reduced Stefan-Boltzmann Constant
L0 = 3.0128*10^(28) # Zero-point luminosity

# Create a function which will give the luminosity using the magnitudes
lum_from_mag <- function(M){
  L = L0*(10^(-M/2.512))
  return(L)
}

# add the luminosity as a variable
stars_cluster <- mutate(stars_cluster, luminosity=lum_from_mag(magnitude))

# add the radius as a new variable using the luminosity function
stars_cluster <- mutate(stars_cluster,
                        radius = k*sqrt(luminosity)/(temp^2))

Hertzsprung-Russell Diagram

You can use the function “brewer.pal()” to get the hex codes from a palette.

Line by line, I do the following: - assign a ggplot to a variable HR_diagram using the stars_cluster dataframe which I generated from the original stars dataset, and add aesthetics temperature for the x axis, magnitude for the y axis, cluster (classification) for the color, and the name of the star for the name of each entry. - add a point geom to the ggplot to generate the scatterplot. - set the y axis to be inverted, ranging from -10 to 20 in labeled increments of 5. - set the x axis to be continuous with manually entered increments, and apply the negative log base 10 transformation to the axis. - add the colors manually based on the order in which I previewed the clusters would appear, assigning a yellow color to the main sequence, a red color for the red giants, a blue color for the supergiants, and a green color for the white dwarfs. - give the axes labels, and add a title. the subtitle and caption were both being omitted by plotly so I decided not to include them. - change the theme to classic. - adjust the title text to be in the center. this line must come after the theme assignment because that will override this command. - call the plot within the ggplotly function, which adds interactivity to the scatterplot so the viewer can see which star is where.

HR_diagram <- ggplot(stars_cluster, aes(x=temp, y=magnitude, color=cluster,
                                        size=(radius), name=star)) +
  geom_point() +
  scale_y_reverse(breaks = seq(-10, 20, 5)) +
  scale_x_continuous(breaks = c(3000, 6000, 10000, 15000, 22000, 30000, 40000), 
                     trans=reverselog_trans(10)) +
  scale_color_manual(values = c("#FDAE61", "#D53E4F", "#3288BD", "#66C2A5")) +
  labs(x = "Temperature in °K (log scale)", y = "Absolute Magnitude",
       title = "dslabs Hertzsprung-Russell Diagram") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))
  

ggplotly(HR_diagram)

Discussion

Looks like a pretty classic HR diagram!! I had an error initially with the radius computations where I accidentally applied the square root around the temperature as well, but after fixing that the sizes became a lot more reasonable. I really like how this visual distinguishes between the classes of stars, and thereby indicates the relationship between temperature and magnitude for each. One thing that is a bit misleading about this diagram are the colors, because most of these supergiants are actually reddish in color, but I wanted them to be easily identifiable by color. One thing I think I could do if I were to do this again is encode temperature using a gradient for each one, perhaps distinguishing the classifications by saturation or using slightly different palettes. A cool idea would be to map these temperatures to their blackbody radiation frequencies, and using a function encode the hex code of the resulting color.

I want to note, it appears that there are major errors with the DS Labs’ stars dataset. This first became apparent to me when I was observing the radius of Canopus which was generated using the data from this dataset, and the value was wildly off. Canopus in reality has a radius of roughly 4.9e+10 meters. This data produces that Canopus has a radius of roughly 1.5e+10 meters. The major source of error, I found, was that the dataset indicates that Canopus has an absolute magnitude of -3.1, when in fact, Canopus has an absolute magnitude of -5.7; this is a major discrepancy, because magnitude is a logarithmic measure. In ASTRO100, I was taught that a difference of 5 in magnitude is roughly a 100 times difference in brightness. This doesn’t seem to be a single-point error, as Delta Canis Majoris in this dataset is indicated to have an absolute magnitude of -8.0, where in reality it has an absolute magnitude of -6.9. Perhaps this is just for learning purposes, to illustrate that there is inherent error in measurement, or something of the sort, but it makes the information from this visualization cursory at best, not to be taken with any sort of precision. These values are taken from Wikipedia.

(edit: there are also huge issues with the temperature variable, e.g. Van Maanen’s star has a very accurate magnitude but a horribly discrepant temperature, differing by about a factor of 2 from reality (13000 vs 6100). this isn’t a problem with my cleaning or manipulations either, because these are present in the original stars dataset as well.)

References

Coding references (I forgot all the syntax… I will never remember it):
- https://www.dataquest.io/blog/write-functions-in-r/
- https://www.w3schools.com/r/r_for_loop.asp
- https://www.geeksforgeeks.org/r-if-statement/
- https://www.statology.org/r-create-vector-of-zeros/
- https://datatofish.com/replace-values-dataframe-r/
- http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually
- https://rpubs.com/aephidayatuloh/clustervisual
- https://stackoverflow.com/questions/11053899/how-to-get-a-reversed-log10-scale-in-ggplot2
- https://rstudio.github.io/cheatsheets/data-visualization.pdf
- https://themockup.blog/posts/2020-08-28-heatmaps-in-ggplot2/index.html
- https://plotly.com/r/hover-text-and-formatting/

Data References:
- https://cran.r-project.org/web/packages/dslabs/dslabs.pdf

Physics References:
- https://astronomy.stackexchange.com/questions/33065/why-is-the-suns-luminosity-not-equal-to-the-zero-point-luminosity
- https://en.wikipedia.org/wiki/Stefan%E2%80%93Boltzmann_law
- https://youtu.be/HVJ7yMgsj3s?si=UzO_uWkeigxP02Rt
- https://sciencing.com/what-do-stars-look-like-4587447.html
- https://www.azoquantum.com/Article.aspx?ArticleID=436
- https://starparty.com/topics/astronomy/stars/star-types/