Introduction

For this project, we will explore the Meteorite Landings Dataset, which I accessed from NASA’s Open Data Portal. The data comes from The Meteoritical Society, containing information on all of the known meteorite landings. Over 45,000 observations were recorded with 10 variables recorded from 860 until 2013, but unfortunately, I was unable to find a ReadMe file describing how the information was collected. The following variables were recorded:

name - name of meteorite (typically a location, often modified with a number, year, composition, etc.)

id - a unique identifier for the meteorite

nametype - ‘valid’: a typical meteorite OR ‘relict’: a meteorite has been highly altered by weathering on Earth

recclass - recommended classification of meteorite (based on physical, chemical, etc.)

mass(g) - mass in grams of meteorite

fall - ‘fell’: witnessed during their fall OR ‘found’: discovered many years later

year - year meteorite fell or was found

reclat - latitude of the meteorite landing

reclong - longitude of the meteorite landing

GeoLocation - coordinate of meteorite landing (parentheses encloses combining reclat and reclong)

Of the four categorical variables, nametype, recclass, fall and year, we will explore fall and year. The reason for this is that the majority of meteorites are valid and there are over 400 recommended classifications. We will also explore the following quantitative variables: mass(g), reclat and reclong. We will clean the dataset by making all variable names lowercase, replacing spaces with underscores, and removing observations with NAs. The two questions we will ponder throughout are, “is there a relationship between a meteorite’s mass and it’s GeoLocation?” and “is there a relationship between the number of meteorite landings and the average mass of meteorites each year?”. We will also explore the years with the highest meteorite activity. To explore the relationship between mass and GeoLocation, we will determine the correlation coefficient and create a multiple regression model. To determine the relationship between number of landings in a year and the average mass, we will group the data by year, count the number of meteorites each year and then calculate the average mass of meteorites each year. This information will then be explored through a variety of models and visualizations.

I chose this topic because I am fascinated with all thing space-related and because my ultimate dream is to work for NASA. I dream to eventually be given the chance to travel to space, or at least work supporting the safe journey for a group of astronauts.

Bibliography:

https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

Tripathi, Kamlakant & Navale, Pratik. (2018). Meteorite Landings. 10.13140/RG.2.2.27368.57602.

Working libraries

#install.packages('tigris')
myPackages <- c("ggplot2", "tidyverse", "tmap", "tmaptools", "leaflet", "leaflet.extras", "dplyr", "rio", "sp", "RColorBrewer", "raster", "rgdal", "sf", "tigris", "highcharter", "plotly")
invisible(lapply(myPackages, library, character.only = TRUE))
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.5      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'raster'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.4.2, released 2022/03/08
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/4.2/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: /Library/Frameworks/R.framework/Versions/4.2/Resources/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
## 
## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
## 
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
## 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## 
## not free for commercial and Governmental use
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:raster':
## 
##     select
## 
## 
## The following object is masked from 'package:rio':
## 
##     export
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Read in the Meteorite Landings Dataset

setwd("/Users/KathyOchoa/Documents/DATA 110/Final Project")
meteoriteData <- read_csv("Meteorite_Landings.csv")
## Rows: 45716 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, nametype, recclass, fall, GeoLocation
## dbl (7): id, mass (g), year, reclat, reclong, States, Counties
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(meteoriteData)
## # A tibble: 6 × 12
##   name      id namet…¹ reccl…² mass …³ fall   year reclat reclong GeoLo…⁴ States
##   <chr>  <dbl> <chr>   <chr>     <dbl> <chr> <dbl>  <dbl>   <dbl> <chr>    <dbl>
## 1 Aachen     1 Valid   L5           21 Fell   1880   50.8    6.08 (50.77…     NA
## 2 Aarhus     2 Valid   H6          720 Fell   1951   56.2   10.2  (56.18…     NA
## 3 Abee       6 Valid   EH4      107000 Fell   1952   54.2 -113    (54.21…     NA
## 4 Acapu…    10 Valid   Acapul…    1914 Fell   1976   16.9  -99.9  (16.88…     NA
## 5 Achir…   370 Valid   L6          780 Fell   1902  -33.2  -65.0  (-33.1…     NA
## 6 Adhi …   379 Valid   EH4        4239 Fell   1919   32.1   71.8  (32.1,…     NA
## # … with 1 more variable: Counties <dbl>, and abbreviated variable names
## #   ¹​nametype, ²​recclass, ³​`mass (g)`, ⁴​GeoLocation

Clean the data by making all variables lowercase and by replacing spaces with underscores.

names(meteoriteData) <- tolower(names(meteoriteData)) # make names lowercase
names(meteoriteData) <- gsub(" ","_",names(meteoriteData)) # replace spaces with underscores

Remove observations with NA values from reclat, reclong, year, and mass_(g) columns.

removeNAs <- meteoriteData %>%
  filter(!is.na(reclat) & !is.na(reclong) & !is.na(year) & !is.na(`mass_(g)`))

Get geographical data files to visualize the world and do a quick plot of the world map

setwd("/Users/KathyOchoa/Documents/DATA 110/world-administrative-boundaries")
worldshapefile <- "world-administrative-boundaries.shp"
worldgeo <- shapefile(worldshapefile)

qtm(worldgeo) # quick thematic plot for reference

Explore Categorical Variable: ‘fall’

Create a scatter plot of the fallen versus found meteorite landings around the world using latitude and longitude

plot1 <- removeNAs %>%
  ggplot(aes(x=reclong, y=reclat)) + # x-axis: longitude and y-axis: latitude
  geom_point(alpha=0.5, aes(color = fall)) + #scatter plot differentiating fallen vs found meteorites
  scale_color_manual(name = "Fall/Found", values = c("Fell" = 'black',
                               "Found" = 'red')) +
  ggtitle("Fallen vs Found Meteorite Landings Around the World") +
  xlab("Longitude") +
  ylab("Latitude") +
  theme_minimal()

plot1

Correlation

Determine if there is a relationship between mass of meteorite and location.

cor(removeNAs$`mass_(g)`, removeNAs$reclat) #correlation of mass with latitude
## [1] 0.02923452
cor(removeNAs$`mass_(g)`, removeNAs$reclong) #correlation of mass with longitude 
## [1] -0.02185385

Since both correlation coefficient values are close to zero, there is no correlation between the mass of a meteorite and that meteorite’s GeoLocation.

Multiple Regression Model

fit1 <- lm(`mass_(g)` ~ reclat + reclong + year + fall + nametype, data = removeNAs)
summary(fit1)
## 
## Call:
## lm(formula = `mass_(g)` ~ reclat + reclong + year + fall + nametype, 
##     data = removeNAs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4212153   -29166    -9334    27507 59692103 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7572095.02  312010.13  24.269  < 2e-16 ***
## reclat            519.20      89.07   5.829 5.62e-09 ***
## reclong            63.75      49.42   1.290    0.197    
## year            -3943.29     145.78 -27.049  < 2e-16 ***
## fallFound      310029.21   23246.20  13.337  < 2e-16 ***
## nametypeValid    5912.70  136028.88   0.043    0.965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 622500 on 38109 degrees of freedom
## Multiple R-squared:  0.01971,    Adjusted R-squared:  0.01958 
## F-statistic: 153.3 on 5 and 38109 DF,  p-value: < 2.2e-16

Since reclong and nametypeValid do not have asterisks to the right, they do not contribute to the model. Let’s remove these variables.

fit2 <- lm(`mass_(g)` ~ reclat + year + fall, data = removeNAs)
summary(fit2)
## 
## Call:
## lm(formula = `mass_(g)` ~ reclat + year + fall, data = removeNAs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4187257   -26939    -7573    28718 59691979 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7549391.61  278499.89  27.107  < 2e-16 ***
## reclat          450.66      71.46   6.306 2.89e-10 ***
## year          -3926.58     145.19 -27.044  < 2e-16 ***
## fallFound    306493.71   23083.71  13.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 622500 on 38111 degrees of freedom
## Multiple R-squared:  0.01967,    Adjusted R-squared:  0.01959 
## F-statistic: 254.9 on 3 and 38111 DF,  p-value: < 2.2e-16

Since the adjusted R-squared value did not improve, this is not a good model.

Explore Quantitative Variable: ‘mass(g)’

Explore the relationship between the number of meteorite landings and the average mass of meteorite landings each year

Create by_year, where you take the removeNAs data and then group by year, and then summarize to count the number of landings and the average mass of meteorite landings.

by_year <- removeNAs %>% #use dataset without NAs
  group_by(year) %>% # combine observations for each year
  summarize(count = n(), # count the number of landings
            avgMass = mean(`mass_(g)`)) # calculate the average mass of meteorites each year

Compare the relationship between number of landings and average mass with a linear regression

countVsAvgMass <-ggplot(by_year, aes(x=count, y = avgMass)) +
  ggtitle("Number of Meteorite Landings vs Average Mass") +
  xlab("Number of Landings") +
  ylab ("Average Mass") +
  theme_minimal(base_size = 12) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y~x, se = FALSE, linetype= "solid", size = 0.5, color = "red") # linear regression line without confidence interval
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
countVsAvgMass

Correlation

Determine if there is any correlation between count (the number of landings each year) and avgMass (the average mass, in grams, of meteorites each year)

cor(by_year$count, by_year$avgMass) # determine the correlation coefficient 
## [1] -0.04858525

Since the correlation coefficient is close to zero, there is no correlation between the number of meteorite landings and the average mass each year.

Model Summary

fit3 <- lm(count ~ avgMass, data = by_year)
summary(fit3)
## 
## Call:
## lm(formula = count ~ avgMass, data = by_year)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -152.5 -149.5 -141.2 -126.3 2891.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.535e+02  2.676e+01   5.738 2.76e-08 ***
## avgMass     -6.155e-06  7.987e-06  -0.771    0.442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 421.5 on 251 degrees of freedom
## Multiple R-squared:  0.002361,   Adjusted R-squared:  -0.001614 
## F-statistic: 0.5939 on 1 and 251 DF,  p-value: 0.4416

Since the the p-value for avgMass is relatively high, it is likely not contributing to the model. These results are confirmed by the low adjusted R-squared value, indicating that this is not a good model.

Explore the top 20 years with the highest number of landings and highest average mass

Let’s take a look at the 10 years with the highest number of meteorite landings and 10 years with the highest average mass

topCount <- by_year %>%
  arrange(desc(count)) %>%  # arrange in descending order of number of landings
  head(10) %>% # take top 10 observations
  arrange(year) # rearrange by year

topAvgMass <- by_year %>% 
  arrange(desc(avgMass)) %>%  # arrange in descending order of average mass
  head(10) %>% # take top 10 observations
  arrange(year) # rearrange by year

Highcharter: Create a plot of the top 10 years with the highest number meteorite landings comparing average mass and number of landings

cols <- c("red","black")
myTitle <- "Top 10 Years with the Highest Number of Meteorite Landings and their Average Mass"
highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = "Average Mass (grams)")),
    list(title = list(text = "Number of Meteorite Landings"), opposite = TRUE)
  ) %>%
  hc_add_series(data = topCount$avgMass,
                name = "Average Mass (grams)",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = topCount$count,
                name = "Number of Meteorite Landings",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = topCount$year,
           tickInterval = 1) %>%
  hc_colors(cols) %>%
  hc_title(text = myTitle,
           margin = 15,
           align = "center")

Highcharter: Create a plot of the top 10 years with the highest average mass of meteorites comparing average mass and number of landings

cols <- c("red","black")
myTitle <- "Top 10 Years with the Highest Average Mass and their Number of Meteorite Landings"
highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = "Average Mass (grams)")),
    list(title = list(text = "Number of Meteorite Landings"), opposite = TRUE)
  ) %>%
  hc_add_series(data = topAvgMass$avgMass,
                name = "Average Mass (grams)",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = topAvgMass$count,
                name = "Number of Meteorite Landings",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = topAvgMass$year,
           tickInterval = 1) %>%
  hc_colors(cols) %>%
  hc_title(text = myTitle,
           margin = 15,
           align = "center")

Conclusion

My first visualization, comparing a categorical variable, represents the GeoLocation of all the known meteorites around the world and it compares those that have been seen while falling versus those that have been found years after they landed. It surprised me that there has been a lot of meteorite landings in North America and in Australia, a majority of them being categorized as “found”. In contrast, I was also surprised to see such high activity of meteorites that had landed and been seen in areas such as India and Europe. While this visualization classifies meteorite activity around the world, I wish I had been able to compare the number of landings throughout the world to determine if any region is more prone to meteorite landings than others. My final two visualizations, comparing quantitative variables, represent the top twenty years with the highest meteorite activity. I was very surprised to see that most of the years with the highest number of landings fell within more recent years (late 20th and early 21st century). In contrast, the top 10 years with the highest average meteorite masses occurred before the mid-twentieth century and as early as the 16th century. These findings make me wonder if the changing environmental favors (i.e. climate change, pollution, etc.) over the decades contribute to meteorite activity. While I would have loved to explore more of this dataset, I truly enjoyed diving into this data and creating intricate visualizations.