Week Four

Objectives

The objective is to create random sub-samples of the data set to observe how the distribution of orbital periods varies. This will help examine differences, identify potential anomalies, and see how sample size affects the reliability of summary statistics.

Load Libraries

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tibble)
## Warning: package 'tibble' was built under R version 4.5.2
library(readr)
## Warning: package 'readr' was built under R version 4.5.2

NASA Data Set

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sampling Data and Statistics

# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25

# number of samples to scrutinize
samples_sizes <- function(data, sample_frac =0.25, n_samples = 3){

df_samples <- tibble()

for (sample_i in 1:n_samples) {
  df_i <- nasa_data %>%
    sample_n(size = sample_frac * nrow(nasa_data), replace = TRUE) %>%
    mutate(sample_num = sample_i)
  
  df_samples <- bind_rows(df_samples, df_i)
}
  
df_summary <- df_samples %>%
  group_by(sample_num) %>%
  summarize(
    sd_orbital_period = sd(orbital_period, na.rm =TRUE),
    min_orbital_period = min(orbital_period, na.rm =TRUE),
    Q1_orbital_period = quantile(orbital_period, 0.25, na.rm = TRUE),
    mean_orbital_period = mean(orbital_period, na.rm =TRUE),
    Q3_orbital_period = quantile(orbital_period, 0.75, na.rm = TRUE),
    max_orbital_period = max(orbital_period, na.rm =TRUE),
  
    
  )
return(df_summary)
}

Sample Size

summary_25 <- samples_sizes(nasa_data, sample_frac = 0.25, n_samples = 3)
summary_50 <- samples_sizes(nasa_data, sample_frac = 0.5, n_samples = 5)


summary_25
## # A tibble: 3 × 7
##   sample_num sd_orbital_period min_orbital_period Q1_orbital_period
##        <int>             <dbl>              <dbl>             <dbl>
## 1          1            30415.           0.000821            0.0126
## 2          2            30465.           0.000821            0.0126
## 3          3            30809.           0.00110             0.0126
## # ℹ 3 more variables: mean_orbital_period <dbl>, Q3_orbital_period <dbl>,
## #   max_orbital_period <dbl>
summary_50
## # A tibble: 5 × 7
##   sample_num sd_orbital_period min_orbital_period Q1_orbital_period
##        <int>             <dbl>              <dbl>             <dbl>
## 1          1            10801.           0.000274            0.0120
## 2          2             7180.           0.000274            0.0129
## 3          3             6506.           0.000274            0.0131
## 4          4            23155.           0.000548            0.0131
## 5          5             7919.           0.000821            0.0129
## # ℹ 3 more variables: mean_orbital_period <dbl>, Q3_orbital_period <dbl>,
## #   max_orbital_period <dbl>

Data Observations

I examined several random sub-samples of the data set. The first group represented 25% of the data and was divided into three sub-samples, while the second group represented 50% of the data and was divided into five sub-samples. The goal was to scrutinize differences in the distributions of orbital_period.

In the first group, I noticed that sometimes two sub-samples had values that were very close to each other, but the third sub-sample was completely different. A similar pattern appeared in the second group: two or three out of the five sub-samples had values in a similar range, while the others were farther apart. This variability is likely due to how the data were sampled; if similar data points are included in a sub-sample, the resulting summary statistics can appear closer to each other.

One clear anomaly in the first group is the mean orbital period of the second sub-sample, which is extremely low compared to the others. For example, its mean is 12.88 days, while the other sub-samples have means around 424 and 1032 days. This occurs because the maximum orbital period in that sub-sample is only about 13,538 days, whereas the other sub-samples reach values above 150,000 days.

After increasing the sample size and the number of sub-samples (50% group), the summary statistics were more consistent across sub-samples, and extreme differences were reduced. For instance, the max orbital periods and mean values became closer, showing that larger samples tend to better reflect the population distribution and reduce the appearance of anomalies caused by small samples.