For this project, we will explore the Meteorite Landings Dataset, which I accessed from NASA’s Open Data Portal. The data comes from The Meteoritical Society, containing information on all of the known meteorite landings. Over 45,000 observations were recorded with 10 variables recorded from 860 until 2013, but unfortunately, I was unable to find a ReadMe file describing how the information was collected. The following variables were recorded:
name - name of meteorite (typically a location, often modified with a number, year, composition, etc.)
id - a unique identifier for the meteorite
nametype - ‘valid’: a typical meteorite OR ‘relict’: a meteorite has been highly altered by weathering on Earth
recclass - recommended classification of meteorite (based on physical, chemical, etc.)
mass(g) - mass in grams of meteorite
fall - ‘fell’: witnessed during their fall OR ‘found’: discovered many years later
year - year meteorite fell or was found
reclat - latitude of the meteorite landing
reclong - longitude of the meteorite landing
GeoLocation - coordinate of meteorite landing (parentheses encloses combining reclat and reclong)
Of the four categorical variables, nametype, recclass, fall and year, we will explore fall and year. The reason for this is that the majority of meteorites are valid and there are over 400 recommended classifications. We will also explore the following quantitative variables: mass(g), reclat and reclong. We will clean the dataset by making all variable names lowercase, replacing spaces with underscores, and removing observations with NAs. The two questions we will ponder throughout are, “is there a relationship between a meteorite’s mass and it’s GeoLocation?” and “is there a relationship between the number of meteorite landings and the average mass of meteorites each year?”. We will also explore the years with the highest meteorite activity. To explore the relationship between mass and GeoLocation, we will determine the correlation coefficient and create a multiple regression model. To determine the relationship between number of landings in a year and the average mass, we will group the data by year, count the number of meteorites each year and then calculate the average mass of meteorites each year. This information will then be explored through a variety of models and visualizations.
I chose this topic because I am fascinated with all thing space-related and because my ultimate dream is to work for NASA. I dream to eventually be given the chance to travel to space, or at least work supporting the safe journey for a group of astronauts.
https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh
Tripathi, Kamlakant & Navale, Pratik. (2018). Meteorite Landings. 10.13140/RG.2.2.27368.57602.
Working libraries
#install.packages('tigris')
myPackages <- c("ggplot2", "tidyverse", "tmap", "tmaptools", "leaflet", "leaflet.extras", "dplyr", "rio", "sp", "RColorBrewer", "raster", "rgdal", "sf", "tigris", "highcharter", "plotly")
invisible(lapply(myPackages, library, character.only = TRUE))
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'raster'
##
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.4.2, released 2022/03/08
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/4.2/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: /Library/Frameworks/R.framework/Versions/4.2/Resources/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
##
## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
##
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
##
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Highcharts (www.highcharts.com) is a Highsoft software product which is
##
## not free for commercial and Governmental use
##
##
## Attaching package: 'plotly'
##
##
## The following object is masked from 'package:raster':
##
## select
##
##
## The following object is masked from 'package:rio':
##
## export
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
Read in the Meteorite Landings Dataset
setwd("/Users/KathyOchoa/Documents/DATA 110/Final Project")
meteoriteData <- read_csv("Meteorite_Landings.csv")
## Rows: 45716 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, nametype, recclass, fall, GeoLocation
## dbl (7): id, mass (g), year, reclat, reclong, States, Counties
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(meteoriteData)
## # A tibble: 6 × 12
## name id namet…¹ reccl…² mass …³ fall year reclat reclong GeoLo…⁴ States
## <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Aachen 1 Valid L5 21 Fell 1880 50.8 6.08 (50.77… NA
## 2 Aarhus 2 Valid H6 720 Fell 1951 56.2 10.2 (56.18… NA
## 3 Abee 6 Valid EH4 107000 Fell 1952 54.2 -113 (54.21… NA
## 4 Acapu… 10 Valid Acapul… 1914 Fell 1976 16.9 -99.9 (16.88… NA
## 5 Achir… 370 Valid L6 780 Fell 1902 -33.2 -65.0 (-33.1… NA
## 6 Adhi … 379 Valid EH4 4239 Fell 1919 32.1 71.8 (32.1,… NA
## # … with 1 more variable: Counties <dbl>, and abbreviated variable names
## # ¹nametype, ²recclass, ³`mass (g)`, ⁴GeoLocation
Clean the data by making all variables lowercase and by replacing spaces with underscores.
names(meteoriteData) <- tolower(names(meteoriteData)) # make names lowercase
names(meteoriteData) <- gsub(" ","_",names(meteoriteData)) # replace spaces with underscores
Remove observations with NA values from reclat, reclong, year, and mass_(g) columns.
removeNAs <- meteoriteData %>%
filter(!is.na(reclat) & !is.na(reclong) & !is.na(year) & !is.na(`mass_(g)`))
Get geographical data files to visualize the world and do a quick plot of the world map
setwd("/Users/KathyOchoa/Documents/DATA 110/world-administrative-boundaries")
worldshapefile <- "world-administrative-boundaries.shp"
worldgeo <- shapefile(worldshapefile)
qtm(worldgeo) # quick thematic plot for reference
Create a scatter plot of the fallen versus found meteorite landings around the world using latitude and longitude
plot1 <- removeNAs %>%
ggplot(aes(x=reclong, y=reclat)) + # x-axis: longitude and y-axis: latitude
geom_point(alpha=0.5, aes(color = fall)) + #scatter plot differentiating fallen vs found meteorites
scale_color_manual(name = "Fall/Found", values = c("Fell" = 'black',
"Found" = 'red')) +
ggtitle("Fallen vs Found Meteorite Landings Around the World") +
xlab("Longitude") +
ylab("Latitude") +
theme_minimal()
plot1
Determine if there is a relationship between mass of meteorite and location.
cor(removeNAs$`mass_(g)`, removeNAs$reclat) #correlation of mass with latitude
## [1] 0.02923452
cor(removeNAs$`mass_(g)`, removeNAs$reclong) #correlation of mass with longitude
## [1] -0.02185385
Since both correlation coefficient values are close to zero, there is no correlation between the mass of a meteorite and that meteorite’s GeoLocation.
fit1 <- lm(`mass_(g)` ~ reclat + reclong + year + fall + nametype, data = removeNAs)
summary(fit1)
##
## Call:
## lm(formula = `mass_(g)` ~ reclat + reclong + year + fall + nametype,
## data = removeNAs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4212153 -29166 -9334 27507 59692103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7572095.02 312010.13 24.269 < 2e-16 ***
## reclat 519.20 89.07 5.829 5.62e-09 ***
## reclong 63.75 49.42 1.290 0.197
## year -3943.29 145.78 -27.049 < 2e-16 ***
## fallFound 310029.21 23246.20 13.337 < 2e-16 ***
## nametypeValid 5912.70 136028.88 0.043 0.965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 622500 on 38109 degrees of freedom
## Multiple R-squared: 0.01971, Adjusted R-squared: 0.01958
## F-statistic: 153.3 on 5 and 38109 DF, p-value: < 2.2e-16
Since reclong and nametypeValid do not have asterisks to the right, they do not contribute to the model. Let’s remove these variables.
fit2 <- lm(`mass_(g)` ~ reclat + year + fall, data = removeNAs)
summary(fit2)
##
## Call:
## lm(formula = `mass_(g)` ~ reclat + year + fall, data = removeNAs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4187257 -26939 -7573 28718 59691979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7549391.61 278499.89 27.107 < 2e-16 ***
## reclat 450.66 71.46 6.306 2.89e-10 ***
## year -3926.58 145.19 -27.044 < 2e-16 ***
## fallFound 306493.71 23083.71 13.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 622500 on 38111 degrees of freedom
## Multiple R-squared: 0.01967, Adjusted R-squared: 0.01959
## F-statistic: 254.9 on 3 and 38111 DF, p-value: < 2.2e-16
Since the adjusted R-squared value did not improve, this is not a good model.
Create by_year, where you take the removeNAs data and then group by year, and then summarize to count the number of landings and the average mass of meteorite landings.
by_year <- removeNAs %>% #use dataset without NAs
group_by(year) %>% # combine observations for each year
summarize(count = n(), # count the number of landings
avgMass = mean(`mass_(g)`)) # calculate the average mass of meteorites each year
countVsAvgMass <-ggplot(by_year, aes(x=count, y = avgMass)) +
ggtitle("Number of Meteorite Landings vs Average Mass") +
xlab("Number of Landings") +
ylab ("Average Mass") +
theme_minimal(base_size = 12) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x, se = FALSE, linetype= "solid", size = 0.5, color = "red") # linear regression line without confidence interval
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
countVsAvgMass
Determine if there is any correlation between count (the number of landings each year) and avgMass (the average mass, in grams, of meteorites each year)
cor(by_year$count, by_year$avgMass) # determine the correlation coefficient
## [1] -0.04858525
Since the correlation coefficient is close to zero, there is no correlation between the number of meteorite landings and the average mass each year.
fit3 <- lm(count ~ avgMass, data = by_year)
summary(fit3)
##
## Call:
## lm(formula = count ~ avgMass, data = by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.5 -149.5 -141.2 -126.3 2891.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.535e+02 2.676e+01 5.738 2.76e-08 ***
## avgMass -6.155e-06 7.987e-06 -0.771 0.442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 421.5 on 251 degrees of freedom
## Multiple R-squared: 0.002361, Adjusted R-squared: -0.001614
## F-statistic: 0.5939 on 1 and 251 DF, p-value: 0.4416
Since the the p-value for avgMass is relatively high, it is likely not contributing to the model. These results are confirmed by the low adjusted R-squared value, indicating that this is not a good model.
Let’s take a look at the 10 years with the highest number of meteorite landings and 10 years with the highest average mass
topCount <- by_year %>%
arrange(desc(count)) %>% # arrange in descending order of number of landings
head(10) %>% # take top 10 observations
arrange(year) # rearrange by year
topAvgMass <- by_year %>%
arrange(desc(avgMass)) %>% # arrange in descending order of average mass
head(10) %>% # take top 10 observations
arrange(year) # rearrange by year
cols <- c("red","black")
myTitle <- "Top 10 Years with the Highest Number of Meteorite Landings and their Average Mass"
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = "Average Mass (grams)")),
list(title = list(text = "Number of Meteorite Landings"), opposite = TRUE)
) %>%
hc_add_series(data = topCount$avgMass,
name = "Average Mass (grams)",
type = "column",
yAxis = 0) %>%
hc_add_series(data = topCount$count,
name = "Number of Meteorite Landings",
type = "line",
yAxis = 1) %>%
hc_xAxis(categories = topCount$year,
tickInterval = 1) %>%
hc_colors(cols) %>%
hc_title(text = myTitle,
margin = 15,
align = "center")
cols <- c("red","black")
myTitle <- "Top 10 Years with the Highest Average Mass and their Number of Meteorite Landings"
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = "Average Mass (grams)")),
list(title = list(text = "Number of Meteorite Landings"), opposite = TRUE)
) %>%
hc_add_series(data = topAvgMass$avgMass,
name = "Average Mass (grams)",
type = "column",
yAxis = 0) %>%
hc_add_series(data = topAvgMass$count,
name = "Number of Meteorite Landings",
type = "line",
yAxis = 1) %>%
hc_xAxis(categories = topAvgMass$year,
tickInterval = 1) %>%
hc_colors(cols) %>%
hc_title(text = myTitle,
margin = 15,
align = "center")
My first visualization, comparing a categorical variable, represents the GeoLocation of all the known meteorites around the world and it compares those that have been seen while falling versus those that have been found years after they landed. It surprised me that there has been a lot of meteorite landings in North America and in Australia, a majority of them being categorized as “found”. In contrast, I was also surprised to see such high activity of meteorites that had landed and been seen in areas such as India and Europe. While this visualization classifies meteorite activity around the world, I wish I had been able to compare the number of landings throughout the world to determine if any region is more prone to meteorite landings than others. My final two visualizations, comparing quantitative variables, represent the top twenty years with the highest meteorite activity. I was very surprised to see that most of the years with the highest number of landings fell within more recent years (late 20th and early 21st century). In contrast, the top 10 years with the highest average meteorite masses occurred before the mid-twentieth century and as early as the 16th century. These findings make me wonder if the changing environmental favors (i.e. climate change, pollution, etc.) over the decades contribute to meteorite activity. While I would have loved to explore more of this dataset, I truly enjoyed diving into this data and creating intricate visualizations.