1 Introduction
The U.S. government has only recently began (or rather, restarted) a scientific investigation into unidentified aerial phenomena (UAP, formerly UFOs). However, there is a long tradition of attempting to study these sightings as scientifically as possible. These efforts range from official government programs, like the U.S. Air Force’s Project Blue Book, to civilian organizations like the Mutual UFO Network (MUFON) or the National UFO Reporting Center (NUFORC), to privately funded programs like the National Institute for Discovery Science (NIDS).
This work has attracted the attentions of numerous highly trained experts, most often engineers, physicists, and psychologists. Statisticians, however, are rather less visible (although see Laurent, et al., 2015, and Krishnamurthy, et al., 2017). This seems odd, as several large databases of UAP sightings are maintained by numerous groups, and at considerable cost. Applying statistical methods to these data sets is likely a fruitful investment.
This paper assembles a basic statistical model that exploits a well-known feature of UAP geography. Plotting sightings on a map is perhaps the most obvious initial investigation. However, it is also the most futile. Researchers quickly discover that they have mapped little more than population density. Compare the map of UAP sightings (left) with the Census population density map (right) below—the largest cities have the largest clusters, medium cities have medium clusters, and rural areas have zero or one or two sightings. This foils any attempt to learn something non-trivial, i.e., locating particular ‘hot zones’ or clusters that may suggest an explanation.
Right: Plot of UAP sightings from 2000–2020 (NUFORC). Left: Plot of population density (U.S. Census). The two maps display the variable: population.
This paper attempts to get around this problem by literally removing population from the equation. Potential hot spots come into focus, suggesting novel hypotheses to be tested in the future, as well as locations best suited to hosting stationary electronic sensors. A city-level data set is collected, including the number of UAP sightings recorded in the NUFORC database, population, population density, and land area. A linear model is fit using least-squares regression. The model meets the assumptions required by linear regression, and appears to capture the data quite well. Large model errors are interpreted as indicating the presence of some effect above-and-beyond that of population. These can be considered UAP ‘hot spots,’ potential sites for further study.
The R code for this project is stored in the Github repository benhorvath/nuforc_stats. Contact the author at benhorvath@gmail.com to request the compiled data.
2 Observational capacity
The simplest model of a territory’s UAP observations abstracts from the particulars of time and geography. It assumes that every person in a given territory has an equal chance of observing UAP, every portion of land within the territory has an equal chance of ‘manifesting’ UAP, and that these probabilities do not change over time. It additionally assumes that rates of reporting do not vary across populations. Although these assumptions are probably wrong, they can be provisionally accepted for the purposes of this initial work.
Comparing the maps above, it is clear that population is a major factor. All else being equal, a territory with a larger population has more people observing the sky. That is, larger cities have more observational capacity, a greater probability of observing and reporting aerial phenomena.
However, the land area of a territory also must be taken into account. At some point, the marginal increase of additional population drops to zero. The additional population will be ‘covering’ a part of the sky that is already well-observed, probably by multiple people simultaneously. There is no benefit to the extra population. This ‘saturation point’ depends on the land area of the territory. The larger the territory, the more population it can be accomodated before saturation. Ultimately, the more area of sky that is ‘covered,’ the more UAP will be reported.
This suggests a basic mathematical representation of a territory \(i\)’s number of UAP sightings \(n\):
\[n_i = f(\mathrm{population}_i, ~ \mathrm{area}_i) + \epsilon\] where \(\epsilon\) is an error term, representing the effect of all other variables besides population and area.
The most basic form this function could take is linear and additive. These models generally represent the random variable of interest as \(y\), modeled as a combination of independent variables \(x_1, x_2, \ldots, x_p\) plus an error term:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\]
where \(\beta_j\) represents a scalar controlling the relationship between an independent variable \(x_j\) and \(y\). Such a model is easily fit from empirical data via the least squares procedure, available in the statistical software R (R Core Team 2020). Given certain assumptions are met, statistical inferences about the fit parameters and error terms can be derived as well.
3 Data
This project uses three main data sources. First, all UAP sightings in the continental U.S. between 2000 and 2020 were scraped from the publicly available NUFORC Database (NUFORC, Renner 2017). This totals 72,298 unique observations. The standard database entry provides an account by the witness, the date and location of the sighting, the object’s shape, and the duration of sighting. The data set required some cleaning. Most notably, all boroughs of New York City, NY were reassigned to ‘New York City,’ and any records marked by NUFORC staff as obvious satellites have been removed (MADAR, etc.).
The U.S. Census Bureau maintains a record of the land and water area of almost all American cities (U.S. Census Bureau 2020). Although the the sightings occur over a twenty year period, areas are used as recorded in the 2020 Gazatteer files. Presumably, most cities’ area has not shifted substantially.
Finally, total population and population density were retrieved from the Census’s 2019 estimates. This data was pulled effortlessly via the R package tidycensus (Walker, et al., 2021). Again, for simplicity, it is assumed that city population and population density have not varied dramatically for most cities over the last two decades.
The combined data set contains nearly 58 thousand unique observations across more than 7700 cities, towns, and villages. This data covers 180 million Americans across all continental states, more than half of the country’s total population.
The plots below show the strong correlations between most of the variables under study. The linearity of each relationship strengthens considerably on a log scale. Note that the only variables not strongly correlated with each other are land area and population density. This becomes important in the next section.
Graphical summaries of the variables under study. Left shows the raw variables; right shows the natural log-transformed variables. The large decimals refer to the Pearson correlation between the variables, with asterisks marking statistical significance.
4 Model
A linear regression model is fit on the natural logs of land area (square miles), population density (persons per square mile), and the interaction between the two. These two variables were chosen among those available because they were not highly correlated, as required by the assumptions of regression.
As always, the result do not indicate a causal relationship, only a statistical association. The full software output is below:
##
## Call:
## lm(formula = log(n) ~ log(density) * log(aland_sqmi), data = dff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7188 -0.5013 -0.0192 0.4891 3.9866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.188593 0.104692 -11.35 <0.0000000000000002
## log(density) 0.227042 0.014589 15.56 <0.0000000000000002
## log(aland_sqmi) -0.507741 0.050374 -10.08 <0.0000000000000002
## log(density):log(aland_sqmi) 0.143005 0.006952 20.57 <0.0000000000000002
##
## (Intercept) ***
## log(density) ***
## log(aland_sqmi) ***
## log(density):log(aland_sqmi) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7347 on 7700 degrees of freedom
## Multiple R-squared: 0.568, Adjusted R-squared: 0.5678
## F-statistic: 3374 on 3 and 7700 DF, p-value: < 0.00000000000000022
All variables, including the interaction term, are highly significant. A comprehensive analysis of the model’s residuals indicates that the model meets the assumptions of linear regression (but see the appendix for more). Thus, a city’s \(i\)’s number of UAP sightings \(n\) over the period 2000–2020 (on average) can be estimated by the equation:
\[n_i = -1.1885 + 0.2270~log(density_i) - 0.5077~log(area_i) + 0.1430~log(density_i)~log(area_i)\]
Interpreting the fit parameters is complicated by the model’s structure. The interaction term necessarily implies that the association between density and the number of UAP observations varies according to the size of the land area, and vice versa. A graphical display communicates the model more succinctly:
Graphical display of the fit model.
The plot demonstrates how both land area and population density interact to ‘produce’ UAP observations. Each colored line represents a different level of population density. For example, a 40 square mile territory with a density of 500 people/square mile (red line) can expect 5 observations over the relevant time span. However, a territory with only 10 square miles (yellow line) can achieve the same number of observations with a population density of 1200 persons/square mile. That is, more people covering less area increases the expected number of UAP—up to the saturation point.
5 Results
This modeling exercise allows us to predict how many observations a city should have, given its size and population, on average. If a city has many more than the predicted value (large error), this suggests that something else is contributing to the quantity of sightings, over-and-above the effect of its observational capacity.
The top 25 potential hot spots suggested by the model are below. The column n refers to the actual number of UAP that were observed in that city from 2000–2020, and the column pred indicates how many were predicted by the model. The column times is n divided by pred, i.e., the relative size of the error.
| state | city | pop | aland_sqmi | density | n | pred | times |
|---|---|---|---|---|---|---|---|
| TX | Mesquite | 140937 | 0.05 | 2982.68 | 15 | 0.28 | 53.87 |
| SC | Pawleys Island | 108 | 0.70 | 153.99 | 38 | 0.89 | 42.87 |
| AZ | Cottonwood | 12253 | 0.14 | 733.66 | 22 | 0.58 | 37.68 |
| AZ | Gila Bend | 2100 | 64.37 | 32.63 | 23 | 0.65 | 35.58 |
| SC | Myrtle Beach | 34695 | 23.50 | 1472.54 | 245 | 8.65 | 28.31 |
| SC | North Myrtle Beach | 16819 | 21.16 | 785.74 | 125 | 5.39 | 23.18 |
| WA | Harrington | 418 | 0.39 | 1071.31 | 15 | 0.94 | 16.02 |
| NC | Ocean Isle Beach | 665 | 3.74 | 179.08 | 21 | 1.35 | 15.58 |
| GA | Cumming | 6547 | 7.19 | 978.83 | 55 | 3.73 | 14.75 |
| MD | Bel Air | 10119 | 0.69 | 3332.16 | 22 | 1.50 | 14.63 |
| AZ | Sedona | 10339 | 18.26 | 543.23 | 58 | 3.98 | 14.56 |
| WV | Hedgesville | 297 | 0.13 | 2259.47 | 7 | 0.52 | 13.38 |
| FL | Naples | 22088 | 12.30 | 1796.48 | 90 | 6.88 | 13.09 |
| FL | Bunnell | 2943 | 139.78 | 21.25 | 5 | 0.43 | 11.63 |
| MD | Ocean City | 6944 | 4.53 | 1532.61 | 39 | 3.65 | 10.69 |
| NY | Liverpool | 2204 | 0.75 | 2911.62 | 16 | 1.55 | 10.29 |
| CA | Mountain View | 82739 | 0.29 | 6900.35 | 9 | 0.88 | 10.21 |
| PA | Langhorne | 1580 | 0.49 | 3211.54 | 12 | 1.20 | 9.97 |
| AZ | Buckeye | 79620 | 392.98 | 202.54 | 45 | 4.58 | 9.83 |
| WA | Davenport | 1744 | 1.67 | 1046.25 | 17 | 1.89 | 8.98 |
| NC | Oak Island | 8386 | 19.27 | 449.66 | 32 | 3.60 | 8.89 |
| FL | Orlando | 287442 | 110.62 | 2599.90 | 289 | 33.09 | 8.73 |
| FL | Vero Beach | 17503 | 11.45 | 1529.12 | 51 | 6.02 | 8.48 |
| NY | Lake George | 874 | 0.59 | 1491.37 | 10 | 1.20 | 8.32 |
| WA | Snohomish | 10154 | 3.52 | 2883.57 | 34 | 4.12 | 8.26 |
For example, based on its population and land area, the town of Mesquite, TX should have about zero observations. Instead it has 15, more than 53 times the model prediction! This suggests that something beyond population and area is ‘producing’ UAP. (Specifically, UAP observations, rather than UAP itself.)
This list is much more helpful to researchers than the raw data, which suggests that the top 25 hot spots are the 25 largest cities. It is interesting to note a large concentration around the Myrtle Beach area, which also includes North Myrtle Beach and Ocean Isle Beach. Large Florida cities with a substantial tourist industry are also well-represented, including Naples and Orlando. Oak Island, NC is another tourist destination. The exceptional quantity of observations concentrated in coastal tourist towns suggest several hypotheses to account for this. Among other, these include that: 1. People in these towns spend a disproportionate amount of time idling on a beach, and thus have more time to observe the sky; or 2. Many people are consuming alcohol and seeing things that are not there. The American southwest is also present with excess observations in Texas and Arizona towns.
However, it must be remembered this is a simple approach using a simple model. This exercise is designed as a first step, not a final step. There is no guarantee that a city is truly exceptional because it appears on this list. One potential problem is that the NUFORC entries do not include an observer ID. It is possible that one location’s excessive sightings may only be a single resident is a UFO obsessive. Additionally, reporting rates probably vary by geography. Some populations may prefer to report to their sightings to a local expert rather than NUFORC (as in the Utah’s Uintah Basin region; see the appendix).
The full list of 7700 cities and towns is available as a CSV on the author’s Github: https://github.com/benhorvath/nuforc_stats/blob/main/results/top_hotspots.csv.
6 Conclusion
Mapping UAP observations from large sighting databases is an obvious first move, but it is stymied by the overwhelming effect of population. This paper developed a simple linear regression model to account for this effect. By modeling a city’s number of sightings as a function of land area and population density, it is possible to produce a superior list of UAP ‘hot spots.’ This list can be used to generate future hypotheses to test (e.g., the relationship between tourism and sightings). It can also be used to suggest placements for electronic sensor platforms (e.g., Skyhub). Building on this model could bring us closer to a scientific explanation for the UAP mystery.
7 References
Krishnamurthy, Harish, Anna Lafontant, and Ren Yi. 2017. ‘A time-series cluster space search scheme for localization of geospatial events in the UFO database.’ Conference paper. Available at https://www.researchgate.net/profile/Harish-Kashyap-2/publication/315729875_A_Time-Series_Cluster_Space_Search_Scheme_for_Localization_of_Geospatial_Events_in_the_Unidentified_Flying_Objects_UFOs_Database/links/58dfbf39aca272059aae31f6/A-Time-Series-Cluster-Space-Search-Scheme-for-Localization-of-Geospatial-Events-in-the-Unidentified-Flying-Objects-UFOs-Database.pdf.
Laurent, Thibault, Christine Thomas-Agnan, and Michaël Vaillant. 2015. ‘Spatial point pattern analysis of the Unidentified Aerial Phenomena in France.’ arXiv preprint
1509.00571. Available at https://arxiv.org/abs/1509.00571.NUFORC. National UFO Reporting Center Database. http://www.nuforc.org/webreports.html.
R Core Team. 2020. R: A Language and Environment For Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Renner, Timothy. 2017. ‘NUFORC Sighting Reports.’ Github repository. https://github.com/timothyrenner/nuforc_sightings_data/.
Salisbury, Frank B. 2010. The Utah UFO Display: A Scientist’s Report. Cedar Fort.
Walker, Kyle, Matt Herman, and Kris Eberwein. 2021.
tidycensus. R package. https://cran.r-project.org/web/packages/tidycensus/.U.S. Census Bureau. 2020. Gazatteer Files. Available at https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html.
8 Appendix: Utah hot spots
Utah is often considered an area of concentrated UAP (Salisbury 2010). Because of this reputation, it is worth examining more closely. After accounting for area and population, the top ten Utah hot spots are:
| state | city | pop | aland_sqmi | density | n | pred | times |
|---|---|---|---|---|---|---|---|
| UT | Green River | 935 | 27.14 | 34.45 | 5 | 0.68 | 7.39 |
| UT | Salt Lake City | 200567 | 110.34 | 1811.83 | 151 | 23.87 | 6.33 |
| UT | Garden City | 617 | 8.78 | 70.45 | 5 | 1.00 | 5.02 |
| UT | Park City | 8526 | 20.42 | 426.59 | 13 | 3.55 | 3.66 |
| UT | Vernal | 10438 | 4.62 | 2258.33 | 11 | 4.38 | 2.51 |
| UT | Springdale | 629 | 4.62 | 136.29 | 3 | 1.25 | 2.39 |
| UT | Hurricane | 19074 | 52.76 | 361.55 | 10 | 4.37 | 2.29 |
| UT | Deweyville | 373 | 6.38 | 58.48 | 2 | 0.88 | 2.27 |
| UT | Moab | 5336 | 4.80 | 1112.63 | 7 | 3.26 | 2.15 |
| UT | Eureka | 707 | 1.48 | 476.42 | 3 | 1.43 | 2.10 |
Salt Lake City has over six times more observations than population and density would suggest. This is a substantial number, but is insufficient to make it a top hot spot.
The largest city in the Uintah Basin, Vernal, only reports 11 sightings. This is unusual because of the area’s reputation for reports of bizarre phenomena. However, this may not be as puzzling as it seems. As Salisbury noted, “Incidentally, the people of the Uintah Basin didn’t bother to report their sightings to the authorities,” instead reporting them to a local science teacher by the name of Junior Hicks (2010, p. 28).
9 Appendix: Residual analysis
The plots below show the fitted model model generally conforms to the standard assumptions of linear regression. There is no clear pattern when the predictions are plotted against the residuals, although the artifacts of only having positive numbers for \(y\) are observable. There is no apparent relationship between the order of the observations, and the residuals are approximately normal with a central tendency near zero. The Q-Q plot shows some strong outliers. These are the handful of American cities with truly exceptional populations, such a New York City and Los Angeles. These large cities account for almost all of the (relatively minor) systematic errors.
Residual diagnostic plots.
Plotting residuals by longitude and latitude does not show systematic errors.
Plotting the residuals against the independent variables do show increased errors for large cities, as mentioned above. The model estimates that cities like New York City or Los Angeles should have way more sightings than they are actually do. For the most part, this is not a concern for the present project.
Plotting residuals by area and density shows a slight pattern due to large cities as outliers.