Population model of UAP sightings

Ben Horvath

June 2021

Abstract

Although the study of unidentified aeriel phenomena (UAP) has only recently captured the interest of the U.S. Congress, there is a long tradition of attempting to use data and the scientific method to explain it. Using large databases of observations, the natural first step is to map reported sightings to locate potential hot spots. However, maps of UAP observations tend to reproduce population distribution. That is, the biggest clusters coincide with the biggest populations. This is not informative. This paper uses simple linear regression to quantify the effect of population and land area on the number of reported UAP. The model errors are then interpreted as the number of sightings in excess of the pure population/area effect. This generates a list of true hot spots, which may suggest novel hypotheses to be tested or placements for electronic sensor systems.

Author e-mail: benhorvath@gmail.com
Github repository: benhorvath/nuforc_stats


Table of Contents

1 Introduction

The U.S. government has only recently began (or rather, restarted) a scientific investigation into unidentified aerial phenomena (UAP, formerly UFOs). However, there is a long tradition of attempting to study these sightings as scientifically as possible. These efforts range from official government programs, like the U.S. Air Force’s Project Blue Book, to civilian organizations like the Mutual UFO Network (MUFON) or the National UFO Reporting Center (NUFORC), to privately funded programs like the National Institute for Discovery Science (NIDS).

This work has attracted the attentions of numerous highly trained experts, most often engineers, physicists, and psychologists. Statisticians, however, are rather less visible (although see Laurent, et al., 2015, and Krishnamurthy, et al., 2017). This seems odd, as several large databases of UAP sightings are maintained by numerous groups, and at considerable cost. Applying statistical methods to these data sets is likely a fruitful investment.

This paper assembles a basic statistical model that exploits a well-known feature of UAP geography. Plotting sightings on a map is perhaps the most obvious initial investigation. However, it is also the most futile. Researchers quickly discover that they have mapped little more than population density. Compare the map of UAP sightings (left) with the Census population density map (right) below—the largest cities have the largest clusters, medium cities have medium clusters, and rural areas have zero or one or two sightings. This foils any attempt to learn something non-trivial, i.e., locating particular ‘hot zones’ or clusters that may suggest an explanation.

Right: Plot of UAP sightings from 2000--2020 (NUFORC). Left: Plot of population density (U.S. Census). The two maps display the variable: population.Right: Plot of UAP sightings from 2000--2020 (NUFORC). Left: Plot of population density (U.S. Census). The two maps display the variable: population.

Right: Plot of UAP sightings from 2000–2020 (NUFORC). Left: Plot of population density (U.S. Census). The two maps display the variable: population.

This paper attempts to get around this problem by literally removing population from the equation. Potential hot spots come into focus, suggesting novel hypotheses to be tested in the future, as well as locations best suited to hosting stationary electronic sensors. A city-level data set is collected, including the number of UAP sightings recorded in the NUFORC database, population, population density, and land area. A linear model is fit using least-squares regression. The model meets the assumptions required by linear regression, and appears to capture the data quite well. Large model errors are interpreted as indicating the presence of some effect above-and-beyond that of population. These can be considered UAP ‘hot spots,’ potential sites for further study.

The R code for this project is stored in the Github repository benhorvath/nuforc_stats. Contact the author at benhorvath@gmail.com to request the compiled data.

2 Observational capacity

The simplest model of a territory’s UAP observations abstracts from the particulars of time and geography. It assumes that every person in a given territory has an equal chance of observing UAP, every portion of land within the territory has an equal chance of ‘manifesting’ UAP, and that these probabilities do not change over time. It additionally assumes that rates of reporting do not vary across populations. Although these assumptions are probably wrong, they can be provisionally accepted for the purposes of this initial work.

Comparing the maps above, it is clear that population is a major factor. All else being equal, a territory with a larger population has more people observing the sky. That is, larger cities have more observational capacity, a greater probability of observing and reporting aerial phenomena.

However, the land area of a territory also must be taken into account. At some point, the marginal increase of additional population drops to zero. The additional population will be ‘covering’ a part of the sky that is already well-observed, probably by multiple people simultaneously. There is no benefit to the extra population. This ‘saturation point’ depends on the land area of the territory. The larger the territory, the more population it can be accomodated before saturation. Ultimately, the more area of sky that is ‘covered,’ the more UAP will be reported.

This suggests a basic mathematical representation of a territory \(i\)’s number of UAP sightings \(n\):

\[n_i = f(\mathrm{population}_i, ~ \mathrm{area}_i) + \epsilon\] where \(\epsilon\) is an error term, representing the effect of all other variables besides population and area.

The most basic form this function could take is linear and additive. These models generally represent the random variable of interest as \(y\), modeled as a combination of independent variables \(x_1, x_2, \ldots, x_p\) plus an error term:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\]

where \(\beta_j\) represents a scalar controlling the relationship between an independent variable \(x_j\) and \(y\). Such a model is easily fit from empirical data via the least squares procedure, available in the statistical software R (R Core Team 2020). Given certain assumptions are met, statistical inferences about the fit parameters and error terms can be derived as well.

3 Data

This project uses three main data sources. First, all UAP sightings in the continental U.S. between 2000 and 2020 were scraped from the publicly available NUFORC Database (NUFORC, Renner 2017). This totals 72,298 unique observations. The standard database entry provides an account by the witness, the date and location of the sighting, the object’s shape, and the duration of sighting. The data set required some cleaning. Most notably, all boroughs of New York City, NY were reassigned to ‘New York City,’ and any records marked by NUFORC staff as obvious satellites have been removed (MADAR, etc.).

The U.S. Census Bureau maintains a record of the land and water area of almost all American cities (U.S. Census Bureau 2020). Although the the sightings occur over a twenty year period, areas are used as recorded in the 2020 Gazatteer files. Presumably, most cities’ area has not shifted substantially.

Finally, total population and population density were retrieved from the Census’s 2019 estimates. This data was pulled effortlessly via the R package tidycensus (Walker, et al., 2021). Again, for simplicity, it is assumed that city population and population density have not varied dramatically for most cities over the last two decades.

The combined data set contains nearly 58 thousand unique observations across more than 7700 cities, towns, and villages. This data covers 180 million Americans across all continental states, more than half of the country’s total population.

The plots below show the strong correlations between most of the variables under study. The linearity of each relationship strengthens considerably on a log scale. Note that the only variables not strongly correlated with each other are land area and population density. This becomes important in the next section.

Graphical summaries of the variables under study. Left shows the raw variables; right shows the natural log-transformed variables. The large decimals refer to the Pearson correlation between the variables, with asterisks marking statistical significance.Graphical summaries of the variables under study. Left shows the raw variables; right shows the natural log-transformed variables. The large decimals refer to the Pearson correlation between the variables, with asterisks marking statistical significance.

Graphical summaries of the variables under study. Left shows the raw variables; right shows the natural log-transformed variables. The large decimals refer to the Pearson correlation between the variables, with asterisks marking statistical significance.

4 Model

A linear regression model is fit on the natural logs of land area (square miles), population density (persons per square mile), and the interaction between the two. These two variables were chosen among those available because they were not highly correlated, as required by the assumptions of regression.

As always, the result do not indicate a causal relationship, only a statistical association. The full software output is below:

## 
## Call:
## lm(formula = log(n) ~ log(density) * log(aland_sqmi), data = dff)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7188 -0.5013 -0.0192  0.4891  3.9866 
## 
## Coefficients:
##                               Estimate Std. Error t value            Pr(>|t|)
## (Intercept)                  -1.188593   0.104692  -11.35 <0.0000000000000002
## log(density)                  0.227042   0.014589   15.56 <0.0000000000000002
## log(aland_sqmi)              -0.507741   0.050374  -10.08 <0.0000000000000002
## log(density):log(aland_sqmi)  0.143005   0.006952   20.57 <0.0000000000000002
##                                 
## (Intercept)                  ***
## log(density)                 ***
## log(aland_sqmi)              ***
## log(density):log(aland_sqmi) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7347 on 7700 degrees of freedom
## Multiple R-squared:  0.568,  Adjusted R-squared:  0.5678 
## F-statistic:  3374 on 3 and 7700 DF,  p-value: < 0.00000000000000022

All variables, including the interaction term, are highly significant. A comprehensive analysis of the model’s residuals indicates that the model meets the assumptions of linear regression (but see the appendix for more). Thus, a city’s \(i\)’s number of UAP sightings \(n\) over the period 2000–2020 (on average) can be estimated by the equation:

\[n_i = -1.1885 + 0.2270~log(density_i) - 0.5077~log(area_i) + 0.1430~log(density_i)~log(area_i)\]

Interpreting the fit parameters is complicated by the model’s structure. The interaction term necessarily implies that the association between density and the number of UAP observations varies according to the size of the land area, and vice versa. A graphical display communicates the model more succinctly:

Graphical display of the fit model.

Graphical display of the fit model.

The plot demonstrates how both land area and population density interact to ‘produce’ UAP observations. Each colored line represents a different level of population density. For example, a 40 square mile territory with a density of 500 people/square mile (red line) can expect 5 observations over the relevant time span. However, a territory with only 10 square miles (yellow line) can achieve the same number of observations with a population density of 1200 persons/square mile. That is, more people covering less area increases the expected number of UAP—up to the saturation point.

5 Results

This modeling exercise allows us to predict how many observations a city should have, given its size and population, on average. If a city has many more than the predicted value (large error), this suggests that something else is contributing to the quantity of sightings, over-and-above the effect of its observational capacity.

The top 25 potential hot spots suggested by the model are below. The column n refers to the actual number of UAP that were observed in that city from 2000–2020, and the column pred indicates how many were predicted by the model. The column times is n divided by pred, i.e., the relative size of the error.

state city pop aland_sqmi density n pred times
TX Mesquite 140937 0.05 2982.68 15 0.28 53.87
SC Pawleys Island 108 0.70 153.99 38 0.89 42.87
AZ Cottonwood 12253 0.14 733.66 22 0.58 37.68
AZ Gila Bend 2100 64.37 32.63 23 0.65 35.58
SC Myrtle Beach 34695 23.50 1472.54 245 8.65 28.31
SC North Myrtle Beach 16819 21.16 785.74 125 5.39 23.18
WA Harrington 418 0.39 1071.31 15 0.94 16.02
NC Ocean Isle Beach 665 3.74 179.08 21 1.35 15.58
GA Cumming 6547 7.19 978.83 55 3.73 14.75
MD Bel Air 10119 0.69 3332.16 22 1.50 14.63
AZ Sedona 10339 18.26 543.23 58 3.98 14.56
WV Hedgesville 297 0.13 2259.47 7 0.52 13.38
FL Naples 22088 12.30 1796.48 90 6.88 13.09
FL Bunnell 2943 139.78 21.25 5 0.43 11.63
MD Ocean City 6944 4.53 1532.61 39 3.65 10.69
NY Liverpool 2204 0.75 2911.62 16 1.55 10.29
CA Mountain View 82739 0.29 6900.35 9 0.88 10.21
PA Langhorne 1580 0.49 3211.54 12 1.20 9.97
AZ Buckeye 79620 392.98 202.54 45 4.58 9.83
WA Davenport 1744 1.67 1046.25 17 1.89 8.98
NC Oak Island 8386 19.27 449.66 32 3.60 8.89
FL Orlando 287442 110.62 2599.90 289 33.09 8.73
FL Vero Beach 17503 11.45 1529.12 51 6.02 8.48
NY Lake George 874 0.59 1491.37 10 1.20 8.32
WA Snohomish 10154 3.52 2883.57 34 4.12 8.26

For example, based on its population and land area, the town of Mesquite, TX should have about zero observations. Instead it has 15, more than 53 times the model prediction! This suggests that something beyond population and area is ‘producing’ UAP. (Specifically, UAP observations, rather than UAP itself.)

This list is much more helpful to researchers than the raw data, which suggests that the top 25 hot spots are the 25 largest cities. It is interesting to note a large concentration around the Myrtle Beach area, which also includes North Myrtle Beach and Ocean Isle Beach. Large Florida cities with a substantial tourist industry are also well-represented, including Naples and Orlando. Oak Island, NC is another tourist destination. The exceptional quantity of observations concentrated in coastal tourist towns suggest several hypotheses to account for this. Among other, these include that: 1. People in these towns spend a disproportionate amount of time idling on a beach, and thus have more time to observe the sky; or 2. Many people are consuming alcohol and seeing things that are not there. The American southwest is also present with excess observations in Texas and Arizona towns.

However, it must be remembered this is a simple approach using a simple model. This exercise is designed as a first step, not a final step. There is no guarantee that a city is truly exceptional because it appears on this list. One potential problem is that the NUFORC entries do not include an observer ID. It is possible that one location’s excessive sightings may only be a single resident is a UFO obsessive. Additionally, reporting rates probably vary by geography. Some populations may prefer to report to their sightings to a local expert rather than NUFORC (as in the Utah’s Uintah Basin region; see the appendix).

The full list of 7700 cities and towns is available as a CSV on the author’s Github: https://github.com/benhorvath/nuforc_stats/blob/main/results/top_hotspots.csv.

6 Conclusion

Mapping UAP observations from large sighting databases is an obvious first move, but it is stymied by the overwhelming effect of population. This paper developed a simple linear regression model to account for this effect. By modeling a city’s number of sightings as a function of land area and population density, it is possible to produce a superior list of UAP ‘hot spots.’ This list can be used to generate future hypotheses to test (e.g., the relationship between tourism and sightings). It can also be used to suggest placements for electronic sensor platforms (e.g., Skyhub). Building on this model could bring us closer to a scientific explanation for the UAP mystery.

7 References

8 Appendix: Utah hot spots

Utah is often considered an area of concentrated UAP (Salisbury 2010). Because of this reputation, it is worth examining more closely. After accounting for area and population, the top ten Utah hot spots are:

state city pop aland_sqmi density n pred times
UT Green River 935 27.14 34.45 5 0.68 7.39
UT Salt Lake City 200567 110.34 1811.83 151 23.87 6.33
UT Garden City 617 8.78 70.45 5 1.00 5.02
UT Park City 8526 20.42 426.59 13 3.55 3.66
UT Vernal 10438 4.62 2258.33 11 4.38 2.51
UT Springdale 629 4.62 136.29 3 1.25 2.39
UT Hurricane 19074 52.76 361.55 10 4.37 2.29
UT Deweyville 373 6.38 58.48 2 0.88 2.27
UT Moab 5336 4.80 1112.63 7 3.26 2.15
UT Eureka 707 1.48 476.42 3 1.43 2.10

Salt Lake City has over six times more observations than population and density would suggest. This is a substantial number, but is insufficient to make it a top hot spot.

The largest city in the Uintah Basin, Vernal, only reports 11 sightings. This is unusual because of the area’s reputation for reports of bizarre phenomena. However, this may not be as puzzling as it seems. As Salisbury noted, “Incidentally, the people of the Uintah Basin didn’t bother to report their sightings to the authorities,” instead reporting them to a local science teacher by the name of Junior Hicks (2010, p. 28).

9 Appendix: Residual analysis

The plots below show the fitted model model generally conforms to the standard assumptions of linear regression. There is no clear pattern when the predictions are plotted against the residuals, although the artifacts of only having positive numbers for \(y\) are observable. There is no apparent relationship between the order of the observations, and the residuals are approximately normal with a central tendency near zero. The Q-Q plot shows some strong outliers. These are the handful of American cities with truly exceptional populations, such a New York City and Los Angeles. These large cities account for almost all of the (relatively minor) systematic errors.

Residual diagnostic plots.

Residual diagnostic plots.

Plotting residuals by longitude and latitude does not show systematic errors.Plotting residuals by longitude and latitude does not show systematic errors.

Plotting residuals by longitude and latitude does not show systematic errors.

Plotting the residuals against the independent variables do show increased errors for large cities, as mentioned above. The model estimates that cities like New York City or Los Angeles should have way more sightings than they are actually do. For the most part, this is not a concern for the present project.

Plotting residuals by area and density shows a slight pattern due to large cities as outliers.Plotting residuals by area and density shows a slight pattern due to large cities as outliers.

Plotting residuals by area and density shows a slight pattern due to large cities as outliers.