Overview

This markdown demonstrates how to access and analyze data from NASA’s Open Data portal using concepts learned in R. We will retrieve Near-Earth Object (NEO) data via the NASA API and perform basic exploratory analysis.

Data Source: NASA Open APIs
https://api.nasa.gov/

1. Load Required Libraries

What these libraries do is assist us with full manipulation capabilities of a particular data set or to fulfill a particular function. These libraries are going to differ between functions you want to deploy and data you are utilizing.

library(httr)
library(jsonlite)
library(dplyr)
library(ggplot2)
library(lubridate)

2. Establishing API

An API key is useful for this demonstration because it is granting access to use the information published by NASA and make accessible on the website. We are ‘asking’ to use the data by running the key and establishing baseline time frames of the data necessary, as seen by start and end date.

api_key <- "DEMO_KEY"

start_date <- "2024-01-01"
end_date   <- "2024-01-07"

url <- paste0(
  "https://api.nasa.gov/neo/rest/v1/feed?",
  "start_date=", start_date,
  "&end_date=", end_date,
  "&api_key=", api_key
)

3. Retrieving Data from NASA website

After establishing the API key, we now need to pull the necessary data from the API. We flatten the ingested data to make it easy to digest and enable more predictive modeling, rather than keeping it over complex or difficult to transform.

response <- GET(url)
content_data <- content(response, as = "text", encoding = "UTF-8")
neo_data <- fromJSON(content_data, flatten = TRUE)

4. Parse and Tidy Data

The data ingested from the NASA API includes hundreds of columns, and we only want to look at specific ones in relation to our inquiry. We need to bind the rows, because there are so many, and create the dataframe from which we can fully digest and manipulate data to create a common operating picture and tell a story of near earth objects and how hazardous they are based on specific variables. Variables include velocity, close proximity, and size of object. Some of the data needs to be read as numeric, because it may additionally include characters that prevent proper analysis of numerical values and halt graphic rendering. The output that you get here are the columns that are being assessed for significance and visualization. These are the most important columns that we are concerned about with NEO data.

library(tidyr)

asteroids <- bind_rows(neo_data$near_earth_objects)

colnames(asteroids)
##  [1] "id"                                                  
##  [2] "neo_reference_id"                                    
##  [3] "name"                                                
##  [4] "nasa_jpl_url"                                        
##  [5] "absolute_magnitude_h"                                
##  [6] "is_potentially_hazardous_asteroid"                   
##  [7] "close_approach_data"                                 
##  [8] "is_sentry_object"                                    
##  [9] "links.self"                                          
## [10] "estimated_diameter.kilometers.estimated_diameter_min"
## [11] "estimated_diameter.kilometers.estimated_diameter_max"
## [12] "estimated_diameter.meters.estimated_diameter_min"    
## [13] "estimated_diameter.meters.estimated_diameter_max"    
## [14] "estimated_diameter.miles.estimated_diameter_min"     
## [15] "estimated_diameter.miles.estimated_diameter_max"     
## [16] "estimated_diameter.feet.estimated_diameter_min"      
## [17] "estimated_diameter.feet.estimated_diameter_max"      
## [18] "sentry_data"
clean_df <- asteroids %>%
  unnest(close_approach_data) %>%
  transmute(
    name,
    close_approach_date = as.Date(close_approach_date),
    velocity_kph = as.numeric(relative_velocity.kilometers_per_hour),
    miss_distance_km = as.numeric(miss_distance.kilometers),
    estimated_diameter_min = estimated_diameter.meters.estimated_diameter_min,
    estimated_diameter_max = estimated_diameter.meters.estimated_diameter_max,
    hazardous = is_potentially_hazardous_asteroid
  )

5. Summarize the Data

After cleaning the columns and transforming the rows of data to all read as numeric, we want to take a look at the product we have so far. We don’t necessarily want to see every column and every row, because it is still a lot of information, but we can group the output by filtering by hazardous objects, true or false. By analyzing the velocity and miss distance and taking the average of both, we can create a delineating factor that establishes the object as hazardous or not.

summary(clean_df)
##      name           close_approach_date   velocity_kph    miss_distance_km  
##  Length:106         Min.   :2024-01-01   Min.   : 13413   Min.   :  242676  
##  Class :character   1st Qu.:2024-01-02   1st Qu.: 28202   1st Qu.:16086292  
##  Mode  :character   Median :2024-01-03   Median : 43828   Median :31380741  
##                     Mean   :2024-01-03   Mean   : 45917   Mean   :34348710  
##                     3rd Qu.:2024-01-06   3rd Qu.: 61269   3rd Qu.:54965219  
##                     Max.   :2024-01-07   Max.   :136268   Max.   :73609797  
##  estimated_diameter_min estimated_diameter_max hazardous      
##  Min.   :  1.756        Min.   :   3.927       Mode :logical  
##  1st Qu.: 17.091        1st Qu.:  38.218       FALSE:98       
##  Median : 38.508        Median :  86.108       TRUE :8        
##  Mean   : 80.872        Mean   : 180.835                      
##  3rd Qu.:108.218        3rd Qu.: 241.984                      
##  Max.   :620.233        Max.   :1386.883
clean_df %>%
  group_by(hazardous) %>%
  summarise(
    avg_velocity = mean(velocity_kph, na.rm = TRUE),
    avg_miss_distance = mean(miss_distance_km, na.rm = TRUE),
    count = n()
  )
## # A tibble: 2 × 4
##   hazardous avg_velocity avg_miss_distance count
##   <lgl>            <dbl>             <dbl> <int>
## 1 FALSE           45149.         34427519.    98
## 2 TRUE            55322.         33383307.     8

6a. Visualize Approach Velocity

We are going to visualize the approach velocity of the objects by using the function ggplot and creating a bar chart from the histogram selection. The x variable is established, and the labels are created appropriately to showcase the number of objects traveling at particular speeds. We keep the theme as minimal for simplicity.

ggplot(clean_df, aes(x = velocity_kph)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Asteroid Approach Velocity",
    x = "Velocity (km/h)",
    y = "Count"
  ) +
  theme_minimal()

6b. Vizualization Approach Velocity and Miss Distance

If we want to look at two particular variables in conjunction, we can use the point plot and add the filter of hazardous by color. You can see that true hazardous objects are labeled as blue.

ggplot(clean_df, aes(x = velocity_kph, y = miss_distance_km, color = hazardous)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Miss Distance vs Velocity",
    x = "Velocity (km/h)",
    y = "Miss Distance (km)",
    color = "Potentially Hazardous"
  ) +
  theme_minimal()