This assignment will use the following packages:
library(readr)
library(RCurl)
library(stringr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
Astronomy has played a key role in advancement of human civilization and to understand our universe. Exoplanets, planets not apart of our system, have been an interest of astronomers for centuries, which may indicate potential possibilities of extraterrestrial life. Gaia, which was launched earlier plays a key role in advancements in Astronomy as it is used to analyze the stars of our universe and for other Astronomy missions.
The following project will utilize three datasets I have obtained from the Sloan Digital Sky survey data and Gaia. The first dataset will be exoplanet dataset from Sloan Digital Sky Survey, the second and third dataset will be from Gaia Archive.
The following datasets are extracted from my Github Page. This will convert them into data frames.
rawgaia <- getURL("https://raw.githubusercontent.com/spacerome/Data607_Project_2/refs/heads/main/gaiadata.csv")
rawgaia2 <- getURL("https://raw.githubusercontent.com/spacerome/Data607_Project_2/refs/heads/main/gaiadata2.csv")
rawsds <- getURL("https://raw.githubusercontent.com/spacerome/Data607_Project_2/refs/heads/main/PS_2024.10.06_14.59.24.csv")
gaiadf <- data.frame(read.csv(text=rawgaia, sep= "\t", stringsAsFactors = FALSE, check.names = FALSE))
gaiadf2 <- data.frame(read.csv(text=rawgaia2, sep= "\t", stringsAsFactors = FALSE, check.names = FALSE))
sdsdf <- data.frame(read.csv(text=rawsds, sep= "\t", stringsAsFactors = FALSE, check.names = FALSE))
This code block will tidy the following dataset from
gaiadf
and gaiadf2
respectively:
gaiadf_tidy <- gaiadf %>%
pivot_longer(
cols = -SOURCE_ID,
names_to = c("measurement_type", "band", "replicate"),
names_pattern = "(\\w+)_(\\w+)_(\\d)",
values_to = "value"
) %>%
filter(!is.na(value))
head(gaiadf_tidy)
## # A tibble: 6 × 5
## SOURCE_ID measurement_type band replicate value
## <dbl> <chr> <chr> <chr> <dbl>
## 1 4.04e18 dec bp 1 -34.4
## 2 4.04e18 dec bp 2 -34.4
## 3 4.04e18 dec g 1 -34.4
## 4 4.04e18 dec g 2 -34.4
## 5 4.04e18 dec rp 1 -34.4
## 6 4.04e18 dec rp 2 -34.4
gaiadf2_tidy <- gaiadf2 %>%
pivot_longer(
cols = -source_id,
names_to = "attribute",
values_to = "value"
) %>%
filter(!is.na(value))
head(gaiadf2_tidy)
## # A tibble: 6 × 3
## source_id attribute value
## <dbl> <chr> <dbl>
## 1 4.19e17 ra 10.1
## 2 4.19e17 dec 56.5
## 3 4.19e17 parallax 14.1
## 4 4.19e17 phot_g_mean_mag 1.94
## 5 4.19e17 phot_bp_mean_mag 2.99
## 6 4.19e17 phot_rp_mean_mag 1.84
We will analyze the data from gaiadf_tidy
first then
will analyze gaiadf2_tidy
.
This will basically calculate the summary statistics for each
measurement such as dec (declination), ra (right ascension), magnitude,
parallax for each band rp
, bp
, and
g
. The bands are defined as bp
is Blue
Photometer, g
is Gaia’s main photometric band, and
rp
is the Red Photometer.
gaiadf_tidy %>%
group_by(measurement_type, band) %>%
summarize(
mean_value = mean(as.numeric(value), na.rm = TRUE),
sd_value = sd(as.numeric(value), na.rm = TRUE),
min_value = min(as.numeric(value), na.rm = TRUE),
max_value = max(as.numeric(value), na.rm = TRUE),
count = n()
) %>%
arrange(measurement_type, band)
## `summarise()` has grouped output by 'measurement_type'. You can override using
## the `.groups` argument.
## # A tibble: 12 × 7
## # Groups: measurement_type [4]
## measurement_type band mean_value sd_value min_value max_value count
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 dec bp -66.2 2.69 -70.2 -34.4 10000
## 2 dec g -66.2 2.69 -70.2 -34.4 10000
## 3 dec rp -66.2 2.69 -70.2 -34.4 10000
## 4 magnitude bp 19.2 2.09 7.61 22.6 9928
## 5 magnitude g 18.5 1.99 7.28 21.1 9998
## 6 magnitude rp 17.7 1.90 5.93 20.8 9930
## 7 parallax bp 1.01 1.39 0.000521 61.2 10000
## 8 parallax g 1.01 1.39 0.000521 61.2 10000
## 9 parallax rp 1.01 1.39 0.000521 61.2 10000
## 10 ra bp 58.1 9.89 50.7 274. 10000
## 11 ra g 58.1 9.89 50.7 274. 10000
## 12 ra rp 58.1 9.89 50.7 274. 10000
Findings: From this we can tell the mean declination
value is consistent across the all three bands, with a mean of
approximately -66.2
and a standard deviation of about
2.69
. The Magnitude values slightly differ across the
bands, with bp
having the tightest mean magnitude of
19.2
, followed by g
with a mean magnitude of
18.5
, and rp
has a mean magnitude of
17.7
. This is expected as the Blue Photometer band is
usually more sensitive to fainter objects. The mean parallax value is
1.01
for all bands, which indicates that the stars in
gaiadf_tidy
have similar parallax values. The high standard
deviation of 1.39
means that there are a broad range of
distances which is expected. Lastly, for right ascension is consistent
across all bansds with a mean of 58.1
, and a high standard
deviation of 9.8
suggests a wider spread of values.
ggplot(gaiadf_tidy, aes(x = band, y = as.numeric(value), fill = band)) +
geom_boxplot() +
facet_wrap(~ measurement_type, scales = "free_y") +
labs(title = "Distribution of Measurements by Band and Type",
y = "Value", x = "Band") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
Findings: For the declination, the distribution for
the mean declination value is around -66
for all three
bands. The spread and whiskers are similar across all bands, indicating
that the declination measurements are consistent across the different
bands. For the magnitude, the values vary slightly across the bands,
with rp
having a slightly lower of about 17.7
,
and the g
band is about 18.5
. The spread is
wider in the rp
band compared to bp
and
g
. The rp
band also shows more outliers, which
indicates a broader range of brightness levels for stars in the red
band. For the parallax, the distribution is very tight and near zero for
all bands, which means most of the stars are far away and have very
small parallax values (hence parallax’s limitations). The spread of the
box is minimal which shows that means that most of the stars have a
similar parallax value. Analyzing the right ascension from the values
indicates that values for ra
are distributed more widely
across all three bands and centered around 58
. There is a
larger spread of ra
values where g
band shows
a slightly wider distribution compared to the bp
and
rp
bands.
gaiadf2_tidy %>%
group_by(attribute) %>%
summarize(
mean_value = mean(as.numeric(value), na.rm = TRUE),
sd_value = sd(as.numeric(value), na.rm = TRUE),
min_value = min(as.numeric(value), na.rm = TRUE),
max_value = max(as.numeric(value), na.rm = TRUE),
count = n()
) %>%
arrange(attribute)
## # A tibble: 8 × 6
## attribute mean_value sd_value min_value max_value count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 dec -3.43 40.4 -83.7 87.0 1000
## 2 parallax 18.3 26.1 0.119 311. 1000
## 3 phot_bp_mean_mag 4.40 0.707 2.88 8.05 1000
## 4 phot_g_mean_mag 3.80 0.534 1.94 4.44 1000
## 5 phot_rp_mean_mag 3.28 0.610 1.84 4.59 1000
## 6 pmdec -27.1 195. -3422. 1165. 1000
## 7 pmra 5.69 243. -2240. 3967. 1000
## 8 ra 183. 101. 0.398 360. 1000
Findings: The Positional Attributes of Right
Ascension and Declination show that ra
ranges from
0.398
to 360
degrees, which covers the entire
sky, and also means gaiadf2_tidy
covers the full range of
the celestial sphere, and dec
ranges from
-83.7
to 87.0
degrees, which shows that the
stars in gaiadf2_tidy
spread across both hemispheres, which
covers a broad range of the sky, with an average value close to zero, a
mean of -3.43
. This data suggests that there is a nearly
even distribution of stars above and below the celestial equator. The
Magnitude for the bp
band has a mean of 4.4
and a slightly higher standard deviation of 0.707
, which
indicates a moderate range of brightness values in this band. The
g
band has a mean magnitude of 3.8
which has a
standard deviation of 0.534
and indicates that most of the
stars are consistently bright in the g
band. The
rp
band has a mean magnitude of 3.28
which
indicates that the stars appear brightest in the red band on average.
The standard deviation of the rp
band is 0.610
which indicates a wider spread of brightness in the rp
band. The difference in mean magnitudes is expected as the sensitivity
varies within the three bands (you will see a lot of this in
observational astronomy and in astronomical data). Also stars are
fainter in the bp
band and brighter in the rp
band, since the bp
band is more sensitive to bluer light,
and this is emitted less by most stars compared to redder light, and
also since the universe is expanding light tends to become more
red-shifted as the distance increases. To explain the proper motion
values, the mean of pmra
, which is 5.69
, and
the pmdec
is -27.1
, this suggests there is a
slight bias in motion towards the west and towards the south. This may
indicate a systemic trend in this data, where the distribution of the
stars’ velocities are relative to the Earth’s motion. The parallax
values indicate a mean parallax of 18.3
which if you want
to calculate the distance (1/parallax
) you will get
0.055
parsecs (0.18 light years) which indicates very close
stars. The standard deviation is quite large which is 26.1
and the range of parallax values is 0.119
to
311
, which shows that the stars vary at different
distances, and the high parallax values indicates that there is a
presence of very nearby stars, which are potentially within our solar
neighborhood.
ggplot(gaiadf2_tidy, aes(x = as.numeric(value), fill = attribute)) +
geom_histogram(binwidth = 0.5, color = "black", alpha = 0.6) +
facet_wrap(~ attribute, scales = "free") +
labs(title = "Histogram of Attributes in gaiadf2_tidy",
x = "Value", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5))
Findings: To summarize this, the dec
part suggests the distribution is relatively uniform as mentioned
before. This suggests that the stars are evenly distributed across the
northern and southern hemispheres. The ra
is uniformly
distributed across the entire range of 0
to
360
degrees, which confirms that the stars are spread
evenly across the sky in ra
. The mean magnitudes are
showing near-normal distributions and as mentioned previously it
supports the explanation that rp
has the brightest stars.
The parallax is highly right-skewed with most values clustered near zero
and a few extreme values up to 300
. This indicates most
stars are far away, and very few close stars having high parallax
values. The Proper motion (pmra
and pmdec
)
show it is right-skewed with large tails, which indicates a few stars
with very high proper motion, and the peaks near zero suggests that most
stars have a low proper motion, but there are
significant outliers.
# Scatter plot of Proper Motion in RA vs Dec
gaiadf2_pm <- gaiadf2_tidy %>%
filter(attribute %in% c("pmra", "pmdec")) %>%
pivot_wider(names_from = attribute, values_from = value)
ggplot(gaiadf2_pm, aes(x = as.numeric(pmra), y = as.numeric(pmdec))) +
geom_point(alpha = 0.6, color = "red") +
labs(title = "Proper Motion: PMRA vs PMDEC",
x = "Proper Motion in RA", y = "Proper Motion in Dec") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5))
Findings: There is a cluster around the origin of
(0,0)
for pmra
and pmdec
which
indicates that most stars have a relatively low proper motion in both
directions. This is mostly because most of the stars in the sky are
moving very slowly relatively to the position of our planet (can go into
further detail if needed). There are outliers here where stars are
scattered far away from the central cluster, which indicates very high
proper motion values in either the right ascension, declination, or
both. These outliers can represent stars that are moving rapidly across
the sky, which may be nearby high-velocity stars, or binary systems.
parallax_magnitude <- gaiadf2_tidy %>%
filter(attribute %in% c("parallax", "phot_g_mean_mag")) %>%
pivot_wider(names_from = attribute, values_from = value)
ggplot(parallax_magnitude, aes(x = as.numeric(parallax), y = as.numeric(phot_g_mean_mag))) +
geom_point(color = "purple", alpha = 0.6) +
labs(title = "Parallax vs G-band Magnitude",
x = "Parallax", y = "G-band Magnitude") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5))
Findings: This scatterplot where it compares the parallax of stars vs the G-Band Magnitude indicates that most of the stars are distant (low parallax) as they are mostly clustered close to zero, and have similar magnitudes in the g-band. There are a few closer stars (high parallax) that appear much brighter. This pattern is consistent with the explanation that stars become fainter with increasing distance.
To write this data into a csv for reproducibility the file will be exported here:
write.csv(gaiadf_tidy, "gaiadf_tidy_analysis.csv", row.names = FALSE)
write.csv(gaiadf2_tidy, "gaiadf2_tidy_analysis.csv", row.names = FALSE)
Similar to with the gaia datasets we will tidy the Sloan Digital Sky survey data into tidy format.
id_columns <- c("pl_name", "hostname", "discoverymethod", "disc_year", "soltype", "pl_controv_flag", "disc_facility")
sdsdf_filtered <- sdsdf %>%
select(all_of(id_columns), starts_with("sy_"))
sdsdf_tidy <- sdsdf_filtered %>%
pivot_longer(
cols = starts_with("sy_"),
names_to = c("type", ".value"),
names_sep = "_",
values_drop_na = TRUE
)
sdsdf_tidy <- sdsdf_tidy %>%
filter(!is.na(vmag) & !is.na(kmag) & !is.na(gaiamag))
We will begin analyzing the dataset with the following code blocks:
sdsdf_tidy %>%
group_by(type) %>%
summarise(
mean_vmag = mean(vmag, na.rm = TRUE),
median_vmag = median(vmag, na.rm = TRUE),
sd_vmag = sd(vmag, na.rm = TRUE),
mean_kmag = mean(kmag, na.rm = TRUE),
median_kmag = median(kmag, na.rm = TRUE),
sd_kmag = sd(kmag, na.rm = TRUE),
mean_gaiamag = mean(gaiamag, na.rm = TRUE),
median_gaiamag = median(gaiamag, na.rm = TRUE),
sd_gaiamag = sd(gaiamag, na.rm = TRUE)
) %>%
mutate(
percent_mean_vmag = mean_vmag / sum(mean_vmag, na.rm = TRUE) * 100,
percent_mean_kmag = mean_kmag / sum(mean_kmag, na.rm = TRUE) * 100,
percent_mean_gaiamag = mean_gaiamag / sum(mean_gaiamag, na.rm = TRUE) * 100
)
## # A tibble: 1 × 13
## type mean_vmag median_vmag sd_vmag mean_kmag median_kmag sd_kmag mean_gaiamag
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sy 13.6 14.2 2.39 11.6 12.3 2.31 13.4
## # ℹ 5 more variables: median_gaiamag <dbl>, sd_gaiamag <dbl>,
## # percent_mean_vmag <dbl>, percent_mean_kmag <dbl>,
## # percent_mean_gaiamag <dbl>
sdsdf_tidy %>%
count(type) %>%
mutate(percent = n / sum(n) * 100)
## # A tibble: 1 × 3
## type n percent
## <chr> <int> <dbl>
## 1 sy 35356 100
ggplot(sdsdf_tidy, aes(x = vmag, fill = type)) +
geom_histogram(aes(y = ..count.. / sum(..count..) * 100), bins = 30, alpha = 0.6) +
facet_wrap(~ type) +
theme_minimal() +
labs(title = "Distribution of Visual Magnitude (Vmag) for Each Type", x = "Visual Magnitude", y = "Percentage (%)") +
theme(
plot.title = element_text(hjust = 0.5)
)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Findings: This states that most of the visual
magnitudes range from 12-15, which suggests that sdsdf_tidy
mainly contains fainter celestial objects. There is a significant drop
for Vmag
values less than 10, which indicates a limited
number of brighter objects and the long tail indicates a frew extremely
faint objects.
ggplot(sdsdf_tidy, aes(x = vmag, y = gaiamag, color = type)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal() +
labs(title = "Relationship between Visual Magnitude (Vmag) and Gaia Magnitude", x = "Visual Magnitude", y = "Gaia Magnitude") +
theme(
plot.title = element_text(hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Findings: From this relationship between the visual
magnitude and the gaia magnitude, there is a high correlation between
the Vmag
and `Gaia Magnitude, where as the visual magnitude
increases, the gaia magnitude also increases. Thus objects that are
brighter in the visible spectrum tend to be brighter in the Gaia
band.
sdsdf_tidy %>%
group_by(discoverymethod) %>%
summarise(
n = n(), # Count of observations
mean_vmag = mean(vmag, na.rm = TRUE),
sd_vmag = sd(vmag, na.rm = TRUE),
mean_kmag = mean(kmag, na.rm = TRUE),
sd_kmag = sd(kmag, na.rm = TRUE)
) %>%
mutate(
percent_n = n / sum(n) * 100, # Percentage of total count
percent_mean_vmag = mean_vmag / sum(mean_vmag) * 100,
percent_mean_kmag = mean_kmag / sum(mean_kmag) * 100
) %>%
select(discoverymethod, n, percent_n, mean_vmag, percent_mean_vmag, mean_kmag, percent_mean_kmag)
## # A tibble: 10 × 7
## discoverymethod n percent_n mean_vmag percent_mean_vmag mean_kmag
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Astrometry 2 0.00566 10.2 7.99 6.25
## 2 Disk Kinematics 1 0.00283 8.44 6.65 5.94
## 3 Eclipse Timing Variati… 21 0.0594 14.6 11.5 13.7
## 4 Imaging 117 0.331 9.13 7.19 6.90
## 5 Orbital Brightness Mod… 21 0.0594 13.9 11.0 13.0
## 6 Pulsar Timing 1 0.00283 20.9 16.4 15.6
## 7 Pulsation Timing Varia… 2 0.00566 14.0 11.0 13.7
## 8 Radial Velocity 2519 7.12 7.88 6.20 5.69
## 9 Transit 32534 92.0 14.0 11.1 12.0
## 10 Transit Timing Variati… 138 0.390 14.0 11.0 12.1
## # ℹ 1 more variable: percent_mean_kmag <dbl>
Findings: From this analysis, this shows that the
transit method is really dominant as it shows a large percentage of
observations for detecting exoplanets where the percentage is
92.02%
. This is definitely expected since the transit
method is one of the most effective techniques for detecting exoplanets
when monitoring large numbers of stars. Pulsar Timing and Eclipse timing
variations have high mean magnitudes, which indicates they are typically
used to detect vary faint objects that are harder to observe. The
imaging and radial velocity methods show relatively low mean magnitudes,
which indicates that they are used to detect brighter objects.
Lastly, I will make a time series of the discoveries over time for the exoplanets.
sdsdf_tidy %>%
group_by(disc_year) %>%
summarise(count = n()) %>%
ggplot(aes(x = disc_year, y = count)) +
geom_line() +
theme_minimal() +
labs(title = "Number of Discoveries Over Time", x = "Discovery Year", y = "Number of Discoveries") +
theme(
plot.title = element_text(hjust = 0.5)
)
This graph explains how prior to the 2000s there were very few discoveries before the year 2000, with the number of detections remaining low. From 2000 to 2010, there was an increase in the number of discoveries which were indications that there were improvements in detection technology and techniques in radial velocity and transit methods explained in Gaia’s website and NASA’s interactive timeline. From 2010 to 2016, this had a large number of discoveries stemming from the large-scale survey missions from NASA’s Kepler Space Telescope which uses the transit method. There weas a decline after 2016, which may be from fewer missions, but an increase around 2018 where the launch of TESS was used to analyze exoplanets. COVID also may have been an impact on this time series as well, since this impacted numerous research projects. Overall, this should increase with the launch of JWST (James Webb Space Telescope) and PLATO (Planetary Transits and Oscillations of Stars) from ESA when it launches in 2025.
This exports the file into csv for reproductibility.
write.csv(sdsdf_tidy, "sdsdf_tidy.csv", row.names = FALSE)
With the analysis of the Gaia Datasets, it appears there may need to be further analysis on the anomalies where we may need to classify the stars by distances with respect to the parallax values. This will determine whether the brighter stars (lower magnitude) with high parallax values are part of known nearby stellar systems. We may need to examine unique stars with high parallax and high magnitude to determine whether or not these can be unique stars or errors. Lastly for the gaia datasets, with the concentration of stars at low parallax values may indicate the detection limit of Gaia’s parallax measurements, where stars below a certain parallax threshold are difficult to measure accurately. To continue to investigate the outliers for proper motion with the Gaia datasets, we can analyze the high proper motion stars to check if they are using their parallax values and if they are close it may explain why their motion is rapid across the sky. Similar to the Gaia Datasets, with the Sloan Digital Sky surveys, further analysis may be needed to analyze the outliers to determine if there are any errors and to determine if any of the detections can be false positives which is common in exoplanet detections.