knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

setwd("~/OneDrive/CUNY GC MS DA+DV/DATA 71000 - Data Analysis Methods/Final Project")

# install.packages("kableExtra")
# install.packages("gplots")
# install.packages('corrplot')
# install.packages("ggstatsplot")
# install.packages("vcd")
# install.packages("ggmosaic")
library(tidyverse)
library(psych)
library(gplots)
library(corrplot)
library(kableExtra)
library(ggstatsplot)
library(vcd)
library(ggmosaic)

options(scipen = 999, digits = 4)

Intro

I started this a few days ago and just finished it up even though after today’s class it seems clear that I shouldn’t be using the chi-square test with this data because there are so many small n’s and zeros in my contingency tables. I should probably be using the Fisher’s exact test because my expected counts are less than 5 in many cells.1 Even though that reference says it is only for 2x2 contingency tables, that is just becuase the calculations would be impossible to do by hand for larger tables.2 This is just meant to be a comparison of the different ways to visualize the chi-square test. Some of the visualizations are pretty bad! But they are there for reference so I know what to use and what not to in the future.

Get and Clean Exoplanet Data

Download data from https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PSCompPars as csv with all columns and all rows.

# Exoplanet Data
col_names <- read_csv("PSCompPars_2023.03.14_17.23.51.csv", skip = 2, n_max = 312)
exo <- read_csv("PSCompPars_2023.03.14_17.23.51.csv", skip = 316)

col_names <- col_names %>%
  mutate(var = str_split_i(str_split_i(`#`,":",1), "# COLUMN ", 2),
         definition = trimws(str_split_i(`#`,":",2))) %>%
  select(var, definition)

exo <- exo %>%
  mutate(across(c(discoverymethod, disc_locale, disc_facility, disc_telescope,
                  disc_instrument), factor),
         n_stars = ordered(sy_snum,
                           levels = sort(unique(exo$sy_snum)),
                           labels = paste(sort(unique(exo$sy_snum)), "star system")),
         n_planets = ordered(sy_pnum,
                           levels = sort(unique(exo$sy_pnum)),
                           labels = paste(sort(unique(exo$sy_pnum)), "planet system")))

Chi-Square tests and plots

Most of the code for Chi-Square testing and plotting comes from here: http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r

Size of Planetary System found by Discovery Method

Create contingency table

sys_planets <- table(exo$discoverymethod, exo$n_planets) 
sys_planets %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
1 planet system 2 planet system 3 planet system 4 planet system 5 planet system 6 planet system 7 planet system 8 planet system
Astrometry 2 0 0 0 0 0 0 0
Disk Kinematics 1 0 0 0 0 0 0 0
Eclipse Timing Variations 6 8 3 0 0 0 0 0
Imaging 50 9 0 4 0 0 0 0
Microlensing 162 14 0 0 0 0 0 0
Orbital Brightness Modulation 4 2 3 0 0 0 0 0
Pulsar Timing 4 0 3 0 0 0 0 0
Pulsation Timing Variations 2 0 0 0 0 0 0 0
Radial Velocity 524 302 108 49 25 21 0 0
Transit 2329 818 450 211 108 39 7 8
Transit Timing Variations 0 15 3 4 2 0 0 0

Visualize the contingency table

balloonplot(t(sys_planets), main ="", xlab ="", ylab="", text.size=.75,
            label.size=1, label.color='cornflowerblue',
            label = TRUE, show.margins = TRUE)

Perform Chi-Square test

chi <- chisq.test(sys_planets)
chi
## 
##  Pearson's Chi-squared test
## 
## data:  sys_planets
## X-squared = 223, df = 70, p-value <0.0000000000000002

Observed counts

chi$observed %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
1 planet system 2 planet system 3 planet system 4 planet system 5 planet system 6 planet system 7 planet system 8 planet system
Astrometry 2 0 0 0 0 0 0 0
Disk Kinematics 1 0 0 0 0 0 0 0
Eclipse Timing Variations 6 8 3 0 0 0 0 0
Imaging 50 9 0 4 0 0 0 0
Microlensing 162 14 0 0 0 0 0 0
Orbital Brightness Modulation 4 2 3 0 0 0 0 0
Pulsar Timing 4 0 3 0 0 0 0 0
Pulsation Timing Variations 2 0 0 0 0 0 0 0
Radial Velocity 524 302 108 49 25 21 0 0
Transit 2329 818 450 211 108 39 7 8
Transit Timing Variations 0 15 3 4 2 0 0 0

Expected counts

round(chi$expected,2) %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
1 planet system 2 planet system 3 planet system 4 planet system 5 planet system 6 planet system 7 planet system 8 planet system
Astrometry 1.16 0.44 0.22 0.10 0.05 0.02 0.00 0.00
Disk Kinematics 0.58 0.22 0.11 0.05 0.03 0.01 0.00 0.00
Eclipse Timing Variations 9.89 3.75 1.83 0.86 0.43 0.19 0.02 0.03
Imaging 36.66 13.88 6.78 3.19 1.60 0.71 0.08 0.10
Microlensing 102.41 38.79 18.93 8.90 4.48 1.99 0.23 0.27
Orbital Brightness Modulation 5.24 1.98 0.97 0.46 0.23 0.10 0.01 0.01
Pulsar Timing 4.07 1.54 0.75 0.35 0.18 0.08 0.01 0.01
Pulsation Timing Variations 1.16 0.44 0.22 0.10 0.05 0.02 0.00 0.00
Radial Velocity 598.76 226.77 110.67 52.03 26.21 11.65 1.36 1.55
Transit 2310.09 874.90 426.96 200.75 101.12 44.94 5.24 5.99
Transit Timing Variations 13.97 5.29 2.58 1.21 0.61 0.27 0.03 0.04

Pearson Residuals

round(chi$residuals, 3) %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
1 planet system 2 planet system 3 planet system 4 planet system 5 planet system 6 planet system 7 planet system 8 planet system
Astrometry 0.775 -0.664 -0.464 -0.318 -0.226 -0.150 -0.051 -0.055
Disk Kinematics 0.548 -0.469 -0.328 -0.225 -0.160 -0.106 -0.036 -0.039
Eclipse Timing Variations -1.237 2.198 0.867 -0.927 -0.658 -0.439 -0.150 -0.160
Imaging 2.203 -1.311 -2.603 0.456 -1.267 -0.845 -0.288 -0.308
Microlensing 5.888 -3.980 -4.351 -2.983 -2.117 -1.412 -0.482 -0.515
Orbital Brightness Modulation -0.541 0.012 2.065 -0.675 -0.479 -0.319 -0.109 -0.117
Pulsar Timing -0.036 -1.242 2.590 -0.595 -0.422 -0.282 -0.096 -0.103
Pulsation Timing Variations 0.775 -0.664 -0.464 -0.318 -0.226 -0.150 -0.051 -0.055
Radial Velocity -3.055 4.996 -0.253 -0.420 -0.236 2.740 -1.166 -1.246
Transit 0.393 -1.924 1.115 0.724 0.684 -0.887 0.767 0.820
Transit Timing Variations -3.737 4.223 0.261 2.529 1.776 -0.521 -0.178 -0.190
corrplot(chi$residuals, is.cor = FALSE)

Contribution in percentage (%)

contribution <- 100*chi$residuals^2/chi$statistic
round(contribution, 3) %>% as.data.frame.matrix() %>% kbl() %>% kable_minimal()
1 planet system 2 planet system 3 planet system 4 planet system 5 planet system 6 planet system 7 planet system 8 planet system
Astrometry 0.270 0.198 0.097 0.045 0.023 0.010 0.001 0.001
Disk Kinematics 0.135 0.099 0.048 0.023 0.011 0.005 0.001 0.001
Eclipse Timing Variations 0.687 2.168 0.337 0.386 0.194 0.086 0.010 0.012
Imaging 2.180 0.771 3.042 0.093 0.720 0.320 0.037 0.043
Microlensing 15.565 7.111 8.497 3.995 2.013 0.894 0.104 0.119
Orbital Brightness Modulation 0.131 0.000 1.915 0.204 0.103 0.046 0.005 0.006
Pulsar Timing 0.001 0.693 3.011 0.159 0.080 0.036 0.004 0.005
Pulsation Timing Variations 0.270 0.198 0.097 0.045 0.023 0.010 0.001 0.001
Radial Velocity 4.191 11.204 0.029 0.079 0.025 3.370 0.610 0.697
Transit 0.069 1.661 0.558 0.235 0.210 0.353 0.264 0.302
Transit Timing Variations 6.269 8.004 0.031 2.872 1.416 0.122 0.014 0.016

Visualize the contribution

corrplot(contribution, is.cor = FALSE)

ggstatsplot package

Code for this plot and some info on interpretation comes from here: https://yuzar-blog.netlify.app/posts/2021-12-14-how-to-conduct-chi-square-test-in-r/

ggbarstats(
  data  = exo, 
  x     = n_planets, 
  y     = discoverymethod, 
  label = "both", 
  palette = "Spectral"
)

Base R mosaicplot function

Code for this plot comes from here: http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r

mosaicplot(t(sys_planets), shade = TRUE, las=2,
           main = "")

vcd package

Code for this plot comes from here: https://statsandr.com/blog/chi-square-test-of-independence-in-r/

I like that it does the chi-square test for you but can’t get the plot to print nicely.

vcd::mosaic(~ n_planets + discoverymethod,
  direction = c("v", "h"),
  data = exo,
  shade = TRUE,
  spacing = spacing_equal(sp = unit(0, "lines")),
  rot_labels=c(90,90,0,0)
)

Not a fan of the assoc plot from the same package either… Problably they would look better if I didn’t have so many small n’s in my contingency table.

assoc(~ n_planets + discoverymethod,
  direction = c("v", "h"),
  data = exo,
  shade = TRUE,
  rot_labels=c(90,90,0,0)
)

ggmosaic package

A few more plots to try from here: https://cran.r-project.org/web/packages/ggmosaic/vignettes/ggmosaic.html

The stacked Bar chart doesn’t really work for this data because the data is so heavily skewed by the transit discovery method. Shown together with the next oe though it’s not bad…

ggplot(data = exo) + 
  geom_mosaic(aes(x = product(n_planets, discoverymethod), fill = n_planets),
              divider = c("vspine", "hbar")) +   
  labs(x="Number of Planets in the Star System", 
       y = "Discovery Method", 
       title = "Size of Planetary System found by Discovery Method") +
  theme_minimal()

ggplot(data = exo) +
  geom_mosaic(aes(x = product(n_planets), fill=n_planets), divider = "vspine") +
  labs(title='f(n_planets | discoverymethod)') + 
  facet_grid(~discoverymethod) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

This is maybe the best looking mosaic plot but I think the Base R plot is more useful because it includes the residuals.

Maybe with enough time I could figure out how to color this by residuals?

ggplot(data = exo) +
  geom_mosaic(aes(x = product(n_planets, discoverymethod), fill = n_planets)) + 
  labs(title='f(n_planets | discoverymethod)') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

This one would definitely need some work. Not sure it’s worth figuring it out anyway…

https://jtr13.github.io/EDAVold/mosaic.html

vcd::doubledecker(discoverymethod ~ n_stars + n_planets, data=exo,
  spacing = spacing_equal(sp = unit(0, "lines")))