Introduction

Research Question

To what extent are the year of observation, the frequency of witnessed meteorite falls, and the specific U.S. state associated with the annual volume of reported UFO sightings?

Dataset Introduction & Reference

Study Name: Meteorites vs UFOs: Detection Bias Study (Combined NUFORC and NASA datasets)

Dataset: Contains combined collection of records from the National UFO Reporting Center (NUFORC) and NASA. It contains dimensions that allow for a state-by-state and year-by-year analysis of sightings. While the primary variables are the counts of UFO sightings and meteorite falls, the inclusion of the state variable allows us to account for geographical reporting differences.

Record Size: 1279 Records (Not a sample size but total records) / 14164 lines of total data

temporal_comparison: 124 - Year-by-year meteorite falls vs UFO reports (1900-2023)

state_comparison: 58 - State-level counts and UFO-to-meteorite ratios

meteorite_detail: 1,097 - Individual witnessed falls with US state assignment

Variables: 13 variables

temporal_comparison: year, meteorite_falls, ufo_sightings

state_comparison: state, meteorite_falls, ufo_sightings, ufo_per_meteorite

meteorite_detail: name, latitude, longitude, date, year, mass_g, meteorite_class, fall_type, us_state, is_us

Key Variables: year, us_state, meteorite_falls, ufo_sightings from

Direct Link: https://github.com/lukeslp/meteorites-ufos-detection-bias

Data Analysis

In this section, to address the research question, I will perform a Multiple Linear Regression analysis. Before modeling, I will conduct Exploratory Data Analysis (EDA) to visualize the distributions of sightings and check for correlations. Specifically, I will generate two box plots to observe the spread of meteorite falls and UFO sightings across different states, followed by a scatter plot to directly compare the two variables.

# Import dataset
library(tidyverse)
library(jsonlite)

# Define the URL for the dataset
url <- "https://github.com/lukeslp/meteorites-ufos-detection-bias/raw/refs/heads/master/meteorites_ufos_detection_bias.json"

# Load data
ufo_raw <- fromJSON(url)

# Extract and Prepare Meteorite Data (by State and Year)
meteorite_counts <- as_tibble(ufo_raw$meteorite_detail) %>%
  filter(is_us == TRUE) %>%
  group_by(year, us_state) %>%
  summarise(meteorite_falls = n(), .groups = 'drop') %>%
  rename(state = us_state)

# Extract UFO Sightings Data
ufo_state_data <- as_tibble(ufo_raw$state_comparison) %>%
  select(state, ufo_sightings)

# Combine them into a final dataset
cleaned_data <- meteorite_counts %>%
  left_join(ufo_state_data, by = "state") %>%
  
  # Data Cleaning
  mutate(state = as.factor(state)) %>%
  filter(!is.na(ufo_sightings)) %>%
  select(year, state, meteorite_falls, ufo_sightings)

# Verify the class
print(class(cleaned_data$state))
## [1] "factor"
# Handle missing values
print(colSums(is.na(cleaned_data)))
##            year           state meteorite_falls   ufo_sightings 
##               0               0               0               0
# Check the structure
str(cleaned_data)
## tibble [155 × 4] (S3: tbl_df/tbl/data.frame)
##  $ year           : int [1:155] 1807 1810 1823 1825 1827 1828 1829 1829 1835 1839 ...
##  $ state          : Factor w/ 41 levels "AL","AR","AZ",..: 6 21 16 15 34 37 8 24 34 19 ...
##  $ meteorite_falls: int [1:155] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ufo_sightings  : int [1:155] 2024 3662 1187 1849 2310 2700 2726 2886 2310 2810 ...
# Summary
summary(cleaned_data)
##       year          state     meteorite_falls ufo_sightings  
##  Min.   :1807   TX     : 14   Min.   :1.000   Min.   :  277  
##  1st Qu.:1892   KY     :  7   1st Qu.:1.000   1st Qu.: 1396  
##  Median :1927   MO     :  7   Median :1.000   Median : 2310  
##  Mean   :1926   OK     :  7   Mean   :1.026   Mean   : 2963  
##  3rd Qu.:1968   TN     :  7   3rd Qu.:1.000   3rd Qu.: 3660  
##  Max.   :2012   AR     :  6   Max.   :2.000   Max.   :16197  
##                 (Other):107
# 1. Box Plot of Meteorite Falls by State
ggplot(cleaned_data, aes(x = reorder(state, meteorite_falls, FUN = median), y = meteorite_falls)) +
  geom_boxplot(fill = "skyblue") +
  coord_flip() +
  labs(title = "Yearly Meteorite Falls by State", x = "State", y = "Meteorite Falls") +
  theme_minimal()

# 2. Box Plot of UFO Sightings by State
ggplot(cleaned_data, aes(x = reorder(state, ufo_sightings, FUN = median), y = ufo_sightings)) +
  geom_boxplot(fill = "lightcoral") +
  coord_flip() +
  labs(title = "UFO Sightings Volume by State", x = "State", y = "Total Reported Sightings") +
  theme_minimal()

# 3. Association Plot: Comparison of Meteorite Falls and UFO Sightings
ggplot(cleaned_data, aes(x = meteorite_falls, y = ufo_sightings)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "darkblue") +
  labs(title = "Association Between Meteorite Falls and UFO Sightings",
       x = "Meteorite Falls",
       y = "UFO Sightings") +
  theme_minimal()

Regression Analysis

I will now apply a Multiple Linear Regression model to evaluate the influence of the predictors on the outcome variable, ufo_sightings.

# Final Model Construction
final_model <- lm(ufo_sightings ~ year + meteorite_falls + state, data = cleaned_data)

# Display Model Summary and Intervals
summary(final_model)
## 
## Call:
## lm(formula = ufo_sightings ~ year + meteorite_falls + state, 
##     data = cleaned_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.856e-12 -3.748e-13  0.000e+00  3.676e-13  4.951e-12 
## 
## Coefficients:
##                   Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)      1.396e+03  5.001e-12  2.791e+14  < 2e-16 ***
## year             6.847e-15  2.514e-15  2.724e+00  0.00748 ** 
## meteorite_falls -9.686e-13  7.169e-13 -1.351e+00  0.17937    
## stateAR         -1.120e+02  1.039e-12 -1.078e+14  < 2e-16 ***
## stateAZ          3.570e+03  1.266e-12  2.821e+15  < 2e-16 ***
## stateCA          1.480e+04  1.166e-12  1.269e+16  < 2e-16 ***
## stateCO          1.869e+03  1.036e-12  1.803e+15  < 2e-16 ***
## stateCT          6.280e+02  1.031e-12  6.090e+14  < 2e-16 ***
## stateFL          6.868e+03  1.559e-12  4.406e+15  < 2e-16 ***
## stateGA          1.330e+03  1.096e-12  1.214e+15  < 2e-16 ***
## stateIA         -1.750e+02  1.102e-12 -1.588e+14  < 2e-16 ***
## stateIL          2.892e+03  1.111e-12  2.603e+15  < 2e-16 ***
## stateIN          1.330e+03  1.153e-12  1.153e+15  < 2e-16 ***
## stateKS         -2.080e+02  1.031e-12 -2.017e+14  < 2e-16 ***
## stateKY          2.930e+02  1.013e-12  2.892e+14  < 2e-16 ***
## stateLA         -2.710e+02  1.058e-12 -2.562e+14  < 2e-16 ***
## stateMD          4.530e+02  1.094e-12  4.141e+14  < 2e-16 ***
## stateME         -2.090e+02  1.064e-12 -1.965e+14  < 2e-16 ***
## stateMI          2.261e+03  1.153e-12  1.961e+15  < 2e-16 ***
## stateMN          7.130e+02  1.264e-12  5.642e+14  < 2e-16 ***
## stateMO          1.414e+03  1.015e-12  1.393e+15  < 2e-16 ***
## stateMS         -6.140e+02  1.263e-12 -4.861e+14  < 2e-16 ***
## stateNC          2.266e+03  1.062e-12  2.134e+15  < 2e-16 ***
## stateND         -1.119e+03  1.547e-12 -7.234e+14  < 2e-16 ***
## stateNE         -6.970e+02  1.263e-12 -5.517e+14  < 2e-16 ***
## stateNJ          1.490e+03  1.566e-12  9.513e+14  < 2e-16 ***
## stateNM          2.740e+02  1.180e-12  2.322e+14  < 2e-16 ***
## stateNV          3.030e+02  1.561e-12  1.941e+14  < 2e-16 ***
## stateNY          4.485e+03  1.263e-12  3.550e+15  < 2e-16 ***
## stateOH          3.058e+03  1.153e-12  2.652e+15  < 2e-16 ***
## stateOK          1.010e+02  1.013e-12  9.966e+13  < 2e-16 ***
## stateOR          2.158e+03  1.266e-12  1.705e+15  < 2e-16 ***
## statePA          3.634e+03  1.058e-12  3.435e+15  < 2e-16 ***
## stateSC          8.420e+02  1.035e-12  8.134e+14  < 2e-16 ***
## stateSD         -1.000e+03  1.263e-12 -7.918e+14  < 2e-16 ***
## stateTN          9.140e+02  1.024e-12  8.929e+14  < 2e-16 ***
## stateTX          4.784e+03  9.563e-13  5.003e+15  < 2e-16 ***
## stateUT          1.280e+02  1.548e-12  8.270e+13  < 2e-16 ***
## stateVA          1.304e+03  1.269e-12  1.027e+15  < 2e-16 ***
## stateVT         -7.900e+02  1.556e-12 -5.077e+14  < 2e-16 ***
## stateWI          1.053e+03  1.094e-12  9.628e+14  < 2e-16 ***
## stateWV         -4.940e+02  1.153e-12 -4.284e+14  < 2e-16 ***
## stateWY         -9.820e+02  1.547e-12 -6.348e+14  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.263e-12 on 112 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.446e+31 on 42 and 112 DF,  p-value: < 2.2e-16
confint(final_model)
##                         2.5 %        97.5 %
## (Intercept)      1.396000e+03  1.396000e+03
## year             1.866939e-15  1.182790e-14
## meteorite_falls -2.389038e-12  4.517917e-13
## stateAR         -1.120000e+02 -1.120000e+02
## stateAZ          3.570000e+03  3.570000e+03
## stateCA          1.480100e+04  1.480100e+04
## stateCO          1.869000e+03  1.869000e+03
## stateCT          6.280000e+02  6.280000e+02
## stateFL          6.868000e+03  6.868000e+03
## stateGA          1.330000e+03  1.330000e+03
## stateIA         -1.750000e+02 -1.750000e+02
## stateIL          2.892000e+03  2.892000e+03
## stateIN          1.330000e+03  1.330000e+03
## stateKS         -2.080000e+02 -2.080000e+02
## stateKY          2.930000e+02  2.930000e+02
## stateLA         -2.710000e+02 -2.710000e+02
## stateMD          4.530000e+02  4.530000e+02
## stateME         -2.090000e+02 -2.090000e+02
## stateMI          2.261000e+03  2.261000e+03
## stateMN          7.130000e+02  7.130000e+02
## stateMO          1.414000e+03  1.414000e+03
## stateMS         -6.140000e+02 -6.140000e+02
## stateNC          2.266000e+03  2.266000e+03
## stateND         -1.119000e+03 -1.119000e+03
## stateNE         -6.970000e+02 -6.970000e+02
## stateNJ          1.490000e+03  1.490000e+03
## stateNM          2.740000e+02  2.740000e+02
## stateNV          3.030000e+02  3.030000e+02
## stateNY          4.485000e+03  4.485000e+03
## stateOH          3.058000e+03  3.058000e+03
## stateOK          1.010000e+02  1.010000e+02
## stateOR          2.158000e+03  2.158000e+03
## statePA          3.634000e+03  3.634000e+03
## stateSC          8.420000e+02  8.420000e+02
## stateSD         -1.000000e+03 -1.000000e+03
## stateTN          9.140000e+02  9.140000e+02
## stateTX          4.784000e+03  4.784000e+03
## stateUT          1.280000e+02  1.280000e+02
## stateVA          1.304000e+03  1.304000e+03
## stateVT         -7.900000e+02 -7.900000e+02
## stateWI          1.053000e+03  1.053000e+03
## stateWV         -4.940000e+02 -4.940000e+02
## stateWY         -9.820000e+02 -9.820000e+02

Interpretation

The coefficients indicate the estimated change in annual UFO sightings for every unit increase in a predictor. If the coefficient for meteorite_falls is statistically significant (p < 0.05), it suggests a direct association between celestial events and sightings. In this case, based on the final_model data, 28 states showed coefficient that shows no statistical significance.

Model Assumptions and Diagnostics

Following the project requirements for Multiple Linear Regression, I will verify the following assumptions:

Core Diagnostics (Covers: Linearity, Homoscedasticity, Normality, Influence)

Linearity: Using Residuals vs Fitted Plot as shown below, it shows the points form a relatively flat, random cloud around the horizontal red line. The red line is nearly straight and stays close to the zero residual line. This suggests that the relationship between predictors (year, state, and meteorite_falls) and ufo_sightings is sufficiently linear.

Homoscedasticity: Using Scale-Location plot as shown below, it shows a slight upward tilt as the fitted values increase. While the spread of residuals is generally consistent, the slight fanning at higher fitted values suggests a very mild case of heteroscedasticity (unequal variance).

Normality of Residuals: Using Q-Q plot as shown below, the majority of the points follow the dashed diagonal line closely in the center but deviate (curve upward) at both the left and right tails. This indicates that the residuals are not perfectly normally distributed; specifically, the distribution has heavy tails. This may suggests some limitations that may affect the precision of the p-values.

par(mfrow=c(2,2)); plot(final_model); par(mfrow=c(1,1))

Independence: Using Residuals vs Order plot as shown below, we look for residuals centered around 0 with no clear drift or sustained patterns to satisfy the independence assumption. The plot shows residuals randomly scattered around the zero line with no discernible pattern, cycles, or trends (The residuals are randomly scattered around the horizontal dashed line at 0. There is no “U-shape,” “wave,” or clear upward/downward trend, which would suggest that the sightings in one year or state are influencing the next in a way the model isn’t capturing). The lack of a “wandering” or “snaking” pattern confirms that the independence of observations assumption holds (The points are tightly clustered around 0, suggesting that the errors are truly random noise), and there is no significant autocorrelation in the sighting reports.

plot(resid(final_model), type="b", 
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Multicollinearity: Using Variance Inflation Factors (VIF) as shown below, it shows correlation value of -0.02756736, which is an extremely weak negative correlation (nearly zero). It indicates that the number of meteorite falls recorded does not change in any systematic way as the years progress. Since these two variables are not “moving together,” the model can easily distinguish the individual effect of the year from the effect of meteorite_falls on UFO sightings.

cor(cleaned_data[, c("year", "meteorite_falls")], use = "complete.obs")
##                        year meteorite_falls
## year             1.00000000     -0.02756736
## meteorite_falls -0.02756736      1.00000000

Conclusion

This analysis explored the extent to which the year of observation, frequency of meteorite falls, and U.S. geographic location are associated with annual UFO sightings. The multiple linear regression model indicates that while certain variables contribute to the model’s structure, the actual association between meteorite falls and UFO reports appears weak, as evidenced by the near-zero correlation of r = -0.027. This suggests that detection bias, where citizens mistake actual celestial events for UFOs, may not be the primary driver of reports in this dataset. Instead, the significant variance observed in the box plots and the regression coefficients for state suggests that geographic and cultural factors, or perhaps population density, play a much larger role in the report.

Implications: The model’s reliability is supported by the diagnostic checks, as the Residuals vs Order plot confirmed the independence of observations and the Residuals vs Fitted plot showed a strong linear relationship. However, the Normal Q-Q plot revealed heavy tails, indicating that the residuals are not perfectly normal, which is a common limitation in self-reported observational data. Furthermore, the slight fanning in the Scale-Location plot suggests mild heteroscedasticity, meaning the model’s predictive accuracy may vary at different sighting volumes. A primary limitation of this study is the lack of a population control variable, as states with more people naturally produce more reports regardless of meteorite activity.

Future Directions: To improve the model’s explanatory power, future research should incorporate state population data as a predictor to control for human density. Additionally, adding an interaction term between year and state could help determine if specific regions experienced unique waves of sightings during particular time periods.

References

LukeSLP. (n.d.). Meteorites vs UFOs: Detection Bias Study [Data set]. GitHub. https://github.com/lukeslp/meteorites-ufos-detection-bias/blob/master/meteorites_ufos_detection_bias.json