knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
setwd("~/OneDrive/CUNY GC MS DA+DV/DATA 71000 - Data Analysis Methods/Final Project")
# install.packages("kableExtra")
# install.packages("gplots")
# install.packages('corrplot')
# install.packages("ggstatsplot")
# install.packages("vcd")
# install.packages("ggmosaic")
library(tidyverse)
library(psych)
library(gplots)
library(corrplot)
library(kableExtra)
library(ggstatsplot)
library(vcd)
library(ggmosaic)
options(scipen = 999, digits = 4)
I started this a few days ago and just finished it up even though after today’s class it seems clear that I shouldn’t be using the chi-square test with this data because there are so many small n’s and zeros in my contingency tables. I should probably be using the Fisher’s exact test because my expected counts are less than 5 in many cells.1 Even though that reference says it is only for 2x2 contingency tables, that is just becuase the calculations would be impossible to do by hand for larger tables.2 This is just meant to be a comparison of the different ways to visualize the chi-square test. Some of the visualizations are pretty bad! But they are there for reference so I know what to use and what not to in the future.
Download data from https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PSCompPars as csv with all columns and all rows.
# Exoplanet Data
col_names <- read_csv("PSCompPars_2023.03.14_17.23.51.csv", skip = 2, n_max = 312)
exo <- read_csv("PSCompPars_2023.03.14_17.23.51.csv", skip = 316)
col_names <- col_names %>%
mutate(var = str_split_i(str_split_i(`#`,":",1), "# COLUMN ", 2),
definition = trimws(str_split_i(`#`,":",2))) %>%
select(var, definition)
exo <- exo %>%
mutate(across(c(discoverymethod, disc_locale, disc_facility, disc_telescope,
disc_instrument), factor),
n_stars = ordered(sy_snum,
levels = sort(unique(exo$sy_snum)),
labels = paste(sort(unique(exo$sy_snum)), "star system")),
n_planets = ordered(sy_pnum,
levels = sort(unique(exo$sy_pnum)),
labels = paste(sort(unique(exo$sy_pnum)), "planet system")))
Most of the code for Chi-Square testing and plotting comes from here: http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r
sys_planets <- table(exo$discoverymethod, exo$n_planets)
sys_planets %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
| 1 planet system | 2 planet system | 3 planet system | 4 planet system | 5 planet system | 6 planet system | 7 planet system | 8 planet system | |
|---|---|---|---|---|---|---|---|---|
| Astrometry | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Disk Kinematics | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Eclipse Timing Variations | 6 | 8 | 3 | 0 | 0 | 0 | 0 | 0 |
| Imaging | 50 | 9 | 0 | 4 | 0 | 0 | 0 | 0 |
| Microlensing | 162 | 14 | 0 | 0 | 0 | 0 | 0 | 0 |
| Orbital Brightness Modulation | 4 | 2 | 3 | 0 | 0 | 0 | 0 | 0 |
| Pulsar Timing | 4 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Pulsation Timing Variations | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Radial Velocity | 524 | 302 | 108 | 49 | 25 | 21 | 0 | 0 |
| Transit | 2329 | 818 | 450 | 211 | 108 | 39 | 7 | 8 |
| Transit Timing Variations | 0 | 15 | 3 | 4 | 2 | 0 | 0 | 0 |
balloonplot(t(sys_planets), main ="", xlab ="", ylab="", text.size=.75,
label.size=1, label.color='cornflowerblue',
label = TRUE, show.margins = TRUE)
chi <- chisq.test(sys_planets)
chi
##
## Pearson's Chi-squared test
##
## data: sys_planets
## X-squared = 223, df = 70, p-value <0.0000000000000002
chi$observed %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
| 1 planet system | 2 planet system | 3 planet system | 4 planet system | 5 planet system | 6 planet system | 7 planet system | 8 planet system | |
|---|---|---|---|---|---|---|---|---|
| Astrometry | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Disk Kinematics | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Eclipse Timing Variations | 6 | 8 | 3 | 0 | 0 | 0 | 0 | 0 |
| Imaging | 50 | 9 | 0 | 4 | 0 | 0 | 0 | 0 |
| Microlensing | 162 | 14 | 0 | 0 | 0 | 0 | 0 | 0 |
| Orbital Brightness Modulation | 4 | 2 | 3 | 0 | 0 | 0 | 0 | 0 |
| Pulsar Timing | 4 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Pulsation Timing Variations | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Radial Velocity | 524 | 302 | 108 | 49 | 25 | 21 | 0 | 0 |
| Transit | 2329 | 818 | 450 | 211 | 108 | 39 | 7 | 8 |
| Transit Timing Variations | 0 | 15 | 3 | 4 | 2 | 0 | 0 | 0 |
round(chi$expected,2) %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
| 1 planet system | 2 planet system | 3 planet system | 4 planet system | 5 planet system | 6 planet system | 7 planet system | 8 planet system | |
|---|---|---|---|---|---|---|---|---|
| Astrometry | 1.16 | 0.44 | 0.22 | 0.10 | 0.05 | 0.02 | 0.00 | 0.00 |
| Disk Kinematics | 0.58 | 0.22 | 0.11 | 0.05 | 0.03 | 0.01 | 0.00 | 0.00 |
| Eclipse Timing Variations | 9.89 | 3.75 | 1.83 | 0.86 | 0.43 | 0.19 | 0.02 | 0.03 |
| Imaging | 36.66 | 13.88 | 6.78 | 3.19 | 1.60 | 0.71 | 0.08 | 0.10 |
| Microlensing | 102.41 | 38.79 | 18.93 | 8.90 | 4.48 | 1.99 | 0.23 | 0.27 |
| Orbital Brightness Modulation | 5.24 | 1.98 | 0.97 | 0.46 | 0.23 | 0.10 | 0.01 | 0.01 |
| Pulsar Timing | 4.07 | 1.54 | 0.75 | 0.35 | 0.18 | 0.08 | 0.01 | 0.01 |
| Pulsation Timing Variations | 1.16 | 0.44 | 0.22 | 0.10 | 0.05 | 0.02 | 0.00 | 0.00 |
| Radial Velocity | 598.76 | 226.77 | 110.67 | 52.03 | 26.21 | 11.65 | 1.36 | 1.55 |
| Transit | 2310.09 | 874.90 | 426.96 | 200.75 | 101.12 | 44.94 | 5.24 | 5.99 |
| Transit Timing Variations | 13.97 | 5.29 | 2.58 | 1.21 | 0.61 | 0.27 | 0.03 | 0.04 |
round(chi$residuals, 3) %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
| 1 planet system | 2 planet system | 3 planet system | 4 planet system | 5 planet system | 6 planet system | 7 planet system | 8 planet system | |
|---|---|---|---|---|---|---|---|---|
| Astrometry | 0.775 | -0.664 | -0.464 | -0.318 | -0.226 | -0.150 | -0.051 | -0.055 |
| Disk Kinematics | 0.548 | -0.469 | -0.328 | -0.225 | -0.160 | -0.106 | -0.036 | -0.039 |
| Eclipse Timing Variations | -1.237 | 2.198 | 0.867 | -0.927 | -0.658 | -0.439 | -0.150 | -0.160 |
| Imaging | 2.203 | -1.311 | -2.603 | 0.456 | -1.267 | -0.845 | -0.288 | -0.308 |
| Microlensing | 5.888 | -3.980 | -4.351 | -2.983 | -2.117 | -1.412 | -0.482 | -0.515 |
| Orbital Brightness Modulation | -0.541 | 0.012 | 2.065 | -0.675 | -0.479 | -0.319 | -0.109 | -0.117 |
| Pulsar Timing | -0.036 | -1.242 | 2.590 | -0.595 | -0.422 | -0.282 | -0.096 | -0.103 |
| Pulsation Timing Variations | 0.775 | -0.664 | -0.464 | -0.318 | -0.226 | -0.150 | -0.051 | -0.055 |
| Radial Velocity | -3.055 | 4.996 | -0.253 | -0.420 | -0.236 | 2.740 | -1.166 | -1.246 |
| Transit | 0.393 | -1.924 | 1.115 | 0.724 | 0.684 | -0.887 | 0.767 | 0.820 |
| Transit Timing Variations | -3.737 | 4.223 | 0.261 | 2.529 | 1.776 | -0.521 | -0.178 | -0.190 |
corrplot(chi$residuals, is.cor = FALSE)
contribution <- 100*chi$residuals^2/chi$statistic
round(contribution, 3) %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
| 1 planet system | 2 planet system | 3 planet system | 4 planet system | 5 planet system | 6 planet system | 7 planet system | 8 planet system | |
|---|---|---|---|---|---|---|---|---|
| Astrometry | 0.270 | 0.198 | 0.097 | 0.045 | 0.023 | 0.010 | 0.001 | 0.001 |
| Disk Kinematics | 0.135 | 0.099 | 0.048 | 0.023 | 0.011 | 0.005 | 0.001 | 0.001 |
| Eclipse Timing Variations | 0.687 | 2.168 | 0.337 | 0.386 | 0.194 | 0.086 | 0.010 | 0.012 |
| Imaging | 2.180 | 0.771 | 3.042 | 0.093 | 0.720 | 0.320 | 0.037 | 0.043 |
| Microlensing | 15.565 | 7.111 | 8.497 | 3.995 | 2.013 | 0.894 | 0.104 | 0.119 |
| Orbital Brightness Modulation | 0.131 | 0.000 | 1.915 | 0.204 | 0.103 | 0.046 | 0.005 | 0.006 |
| Pulsar Timing | 0.001 | 0.693 | 3.011 | 0.159 | 0.080 | 0.036 | 0.004 | 0.005 |
| Pulsation Timing Variations | 0.270 | 0.198 | 0.097 | 0.045 | 0.023 | 0.010 | 0.001 | 0.001 |
| Radial Velocity | 4.191 | 11.204 | 0.029 | 0.079 | 0.025 | 3.370 | 0.610 | 0.697 |
| Transit | 0.069 | 1.661 | 0.558 | 0.235 | 0.210 | 0.353 | 0.264 | 0.302 |
| Transit Timing Variations | 6.269 | 8.004 | 0.031 | 2.872 | 1.416 | 0.122 | 0.014 | 0.016 |
corrplot(contribution, is.cor = FALSE)
ggstatsplot packageCode for this plot and some info on interpretation comes from here: https://yuzar-blog.netlify.app/posts/2021-12-14-how-to-conduct-chi-square-test-in-r/
ggbarstats(
data = exo,
x = n_planets,
y = discoverymethod,
label = "both",
palette = "Spectral"
)
mosaicplot functionCode for this plot comes from here: http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r
mosaicplot(t(sys_planets), shade = TRUE, las=2,
main = "")
vcd packageCode for this plot comes from here: https://statsandr.com/blog/chi-square-test-of-independence-in-r/
I like that it does the chi-square test for you but can’t get the plot to print nicely.
vcd::mosaic(~ n_planets + discoverymethod,
direction = c("v", "h"),
data = exo,
shade = TRUE,
spacing = spacing_equal(sp = unit(0, "lines")),
rot_labels=c(90,90,0,0)
)
Not a fan of the assoc plot from the same package either… Problably they would look better if I didn’t have so many small n’s in my contingency table.
assoc(~ n_planets + discoverymethod,
direction = c("v", "h"),
data = exo,
shade = TRUE,
rot_labels=c(90,90,0,0)
)
ggmosaic packageA few more plots to try from here: https://cran.r-project.org/web/packages/ggmosaic/vignettes/ggmosaic.html
The stacked Bar chart doesn’t really work for this data because the data is so heavily skewed by the transit discovery method. Shown together with the next oe though it’s not bad…
ggplot(data = exo) +
geom_mosaic(aes(x = product(n_planets, discoverymethod), fill = n_planets),
divider = c("vspine", "hbar")) +
labs(x="Number of Planets in the Star System",
y = "Discovery Method",
title = "Size of Planetary System found by Discovery Method") +
theme_minimal()
ggplot(data = exo) +
geom_mosaic(aes(x = product(n_planets), fill=n_planets), divider = "vspine") +
labs(title='f(n_planets | discoverymethod)') +
facet_grid(~discoverymethod) +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Maybe with enough time I could figure out how to color this by residuals?
ggplot(data = exo) +
geom_mosaic(aes(x = product(n_planets, discoverymethod), fill = n_planets)) +
labs(title='f(n_planets | discoverymethod)') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
https://jtr13.github.io/EDAVold/mosaic.html
vcd::doubledecker(discoverymethod ~ n_stars + n_planets, data=exo,
spacing = spacing_equal(sp = unit(0, "lines")))