Some country names are not available or listed. A quick check of not-available (NA) values, shows that 9670 entries lack a country identifier. (That leaves 70662 entries that do have a country.) With a fairly large number of NA’s, it is worth looking at what those are. By filtering them out and scanning the head and tail of the entries, it is clear that many entries that lack a country do have a state listed. Some of those “states” are Canadian provinces, but the majority are US states simply lacking a country identifier. I will continue to leave them out of this analysis (to avoid conflating Canada and the United States), but filtering those for entries with neither state nor country provides a list of 3256 rows. Reading through this list, it is evident that many of these are countries from around the world. For future analysis, it may be useful to pull the country name out of the city descriptor column and add that to the country column.
Analyzing the NA entries
missingcountry <- newufo %>% filter( is.na(country))
head(missingcountry)
## # A tibble: 6 x 13
## datetime city state country shape `duration (seco…
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 1949-10-10 21:00:00 lack… tx <NA> light 7200
## 2 1973-10-10 23:00:00 berm… <NA> <NA> light 20
## 3 1979-10-10 22:00:00 sadd… ab <NA> tria… 270
## 4 1982-10-10 07:00:00 gisb… <NA> <NA> disk 120
## 5 1986-10-10 20:00:00 holm… ny <NA> chev… 180
## 6 1989-10-10 21:00:00 kran… ky <NA> tria… 180
## # … with 7 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>
str(missingcountry)
## tibble [9,670 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:9670], format: "1949-10-10 21:00:00" "1973-10-10 23:00:00" ...
## $ city : chr [1:9670] "lackland afb" "bermuda nas" "saddle lake (canada)" "gisborne (new zealand)" ...
## $ state : chr [1:9670] "tx" NA "ab" NA ...
## $ country : chr [1:9670] NA NA NA NA ...
## $ shape : chr [1:9670] "light" "light" "triangle" "disk" ...
## $ duration (seconds) : num [1:9670] 7200 20 270 120 180 180 1200 3600 300 60 ...
## $ duration (hours/min): chr [1:9670] "1-2 hrs" "20 sec." "4.5 or more min." "2min" ...
## $ comments : chr [1:9670] "1949 Lackland AFB, TX. Lights racing across the sky & making 90 degree turns on a dime." "saw fast moving blip on the radar scope thin went outside and saw it again." "Lights far above, that glance; then flee from the celestrialhavens, only to appear again." "gisborne nz 1982 wainui beach to sponge bay" ...
## $ date posted : chr [1:9670] "12/16/2005" "1/11/2002" "1/19/2005" "1/11/2002" ...
## $ latitude : num [1:9670] 29.4 32.4 54 -38.7 41.5 ...
## $ longitude : num [1:9670] -98.6 -64.7 -111.7 178 -73.6 ...
## $ year : num [1:9670] 1949 1973 1979 1982 1986 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 10 10 10 10 10 10 10 10 10 ...
## - attr(*, "problems")= tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:4] 27823 35693 43783 58592
## ..$ col : chr [1:4] "duration (seconds)" "duration (seconds)" "latitude" "duration (seconds)"
## ..$ expected: chr [1:4] "no trailing characters" "no trailing characters" "no trailing characters" "no trailing characters"
## ..$ actual : chr [1:4] "`" "`" "q.200088" "`"
## ..$ file : chr [1:4] "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'"
## - attr(*, "spec")=
## .. cols(
## .. datetime = col_character(),
## .. city = col_character(),
## .. state = col_character(),
## .. country = col_character(),
## .. shape = col_character(),
## .. `duration (seconds)` = col_double(),
## .. `duration (hours/min)` = col_character(),
## .. comments = col_character(),
## .. `date posted` = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double()
## .. )
missingcountry_notUS <- missingcountry %>% filter(is.na(state))
head(missingcountry_notUS)
## # A tibble: 6 x 13
## datetime city state country shape `duration (seco…
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 1973-10-10 23:00:00 berm… <NA> <NA> light 20
## 2 1982-10-10 07:00:00 gisb… <NA> <NA> disk 120
## 3 1993-10-10 03:00:00 zlat… <NA> <NA> sphe… 1200
## 4 1996-10-10 20:00:00 lake… <NA> <NA> light 300
## 5 2003-10-10 23:00:00 bick… <NA> <NA> unkn… 2700
## 6 2004-10-10 15:20:00 keda… <NA> <NA> oval 240
## # … with 7 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>
str(missingcountry_notUS)
## tibble [3,256 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:3256], format: "1973-10-10 23:00:00" "1982-10-10 07:00:00" ...
## $ city : chr [1:3256] "bermuda nas" "gisborne (new zealand)" "zlatoust (russia)" "lake macquarie (nsw, australia)" ...
## $ state : chr [1:3256] NA NA NA NA ...
## $ country : chr [1:3256] NA NA NA NA ...
## $ shape : chr [1:3256] "light" "disk" "sphere" "light" ...
## $ duration (seconds) : num [1:3256] 20 120 1200 300 2700 240 600 300 1200 3600 ...
## $ duration (hours/min): chr [1:3256] "20 sec." "2min" "20 minutes" "5 min" ...
## $ comments : chr [1:3256] "saw fast moving blip on the radar scope thin went outside and saw it again." "gisborne nz 1982 wainui beach to sponge bay" "I woke up at night and looked out the window near my bed. There was a huge sphere of shining light in front of "| __truncated__ "RED LIGHT WITH OTHER RED FLASHING LIGHT, ONE OBJECT" ...
## $ date posted : chr [1:3256] "1/11/2002" "1/11/2002" "12/14/2004" "5/24/1999" ...
## $ latitude : num [1:3256] 32.4 -38.7 55.2 -33.1 53.1 ...
## $ longitude : num [1:3256] -64.68 178.02 59.65 151.59 -2.74 ...
## $ year : num [1:3256] 1973 1982 1993 1996 2003 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 10 10 10 10 10 10 10 10 10 ...
## - attr(*, "problems")= tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:4] 27823 35693 43783 58592
## ..$ col : chr [1:4] "duration (seconds)" "duration (seconds)" "latitude" "duration (seconds)"
## ..$ expected: chr [1:4] "no trailing characters" "no trailing characters" "no trailing characters" "no trailing characters"
## ..$ actual : chr [1:4] "`" "`" "q.200088" "`"
## ..$ file : chr [1:4] "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'" "'scrubbed.csv'"
## - attr(*, "spec")=
## .. cols(
## .. datetime = col_character(),
## .. city = col_character(),
## .. state = col_character(),
## .. country = col_character(),
## .. shape = col_character(),
## .. `duration (seconds)` = col_double(),
## .. `duration (hours/min)` = col_character(),
## .. comments = col_character(),
## .. `date posted` = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double()
## .. )
Analyzing the data only from sightings with a valid country field, there are five countries that have reported UFO sightings. The US is by far the most common source of reports.
Why Germany?
Interestingly, all but 105 of the sightings with identified countries were reported in English-speaking countries. It is probably worth exploring the reports from Germany to see if there is an explanation for why it is the only non-English speaking country in this data.
Code pulling out the Germany sightings from the data
germanysightings <- countrysightings %>% filter(country == "de")
head(germanysightings)
## # A tibble: 6 x 13
## datetime city state country shape `duration (seco…
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 2006-10-13 00:02:00 berl… <NA> de fire… 120
## 2 2012-10-20 18:00:00 berl… <NA> de unkn… 1500
## 3 2012-10-08 17:10:00 ober… <NA> de tria… 2
## 4 2011-01-10 18:38:00 otte… <NA> de tria… 240
## 5 1990-11-15 22:30:00 brem… <NA> de unkn… 30
## 6 2005-11-15 15:00:00 semb… <NA> de egg 120
## # … with 7 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>
tail(germanysightings)
## # A tibble: 6 x 13
## datetime city state country shape `duration (seco…
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 2009-09-12 19:00:00 graf… <NA> de diam… 60
## 2 2011-09-13 12:00:00 heil… <NA> de sphe… 5
## 3 2007-09-16 08:15:00 gels… <NA> de light 20
## 4 2007-09-16 18:15:00 neck… <NA> de other 30
## 5 2011-09-04 05:00:00 mann… <NA> de light 1800
## 6 2009-09-09 21:38:00 kais… <NA> de light 40
## # … with 7 more variables: `duration (hours/min)` <chr>, comments <chr>, `date
## # posted` <chr>, latitude <dbl>, longitude <dbl>, year <dbl>, month <ord>
There is no immediate pattern apparent in the Germany sightings. Some of the sightings are at locations of US military bases, so that may explain some of the reports. At least one of the comments, however, is written in French, suggesting it was not written by an American servicemember.
By examining the year column, we can see if there is a pattern to the time period these were reported.
Code counting the sightings (n) per year in Germany
germanysightings %>% count(year)
## # A tibble: 35 x 2
## year n
## <dbl> <int>
## 1 1962 1
## 2 1968 2
## 3 1969 2
## 4 1970 1
## 5 1971 1
## 6 1973 1
## 7 1974 1
## 8 1975 1
## 9 1979 1
## 10 1981 1
## # … with 25 more rows
ggplot(data = germanysightings) + geom_bar(mapping = aes(x = year))

This histogram and table show the sightings were fairly evenly spread out (with 1 or 2 every couple years) until the 2000s when the numbers increased, reaching an anomalous high of 15 in 2008. The increase into the 2000s tracks with what we saw with the overall increase in reports of UFO sightings worldwide in the late 1990s and 2000s.
Difference between Northern and Southern Hemispheres?
The overall data showed the majority of sightings took place between June and November, summer and autumn in the northern hemisphere. Does this hold true for the sightings reported from Australia in the southern hemisphere:
Code pulling out sightings per month in Australia
australiasightings <- countrysightings %>% filter(country == "au")
australiasightings %>% count(month)
## # A tibble: 12 x 2
## month n
## <ord> <int>
## 1 Jan 60
## 2 Feb 32
## 3 Mar 50
## 4 Apr 53
## 5 May 47
## 6 Jun 66
## 7 Jul 49
## 8 Aug 37
## 9 Sep 29
## 10 Oct 28
## 11 Nov 43
## 12 Dec 44
ggplot(data = australiasightings) + geom_bar(mapping = aes(x = month))

Calculating the mean and the median for number of sightings per month, we can see which months are higher than average and median.