Exploratory Data Analysis:

The aim of this project is to explore the International Best Track Archive for Climate Stewardship (IBTrACS) data set. Throughout this project, I will address some claims given about the hurricane data set and use a variety of data visualization methods to either prove or disprove those claims.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
#install.packages(c("maps", "rnaturalearth"))
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(rnaturalearth)
#install.packages("gganimate")
library(gganimate)

Before importing the CSV file, I created a vector for the column names that the assignment required us to import, from SID to LANDFALL. I then created another vector assigning the data types of each column.

col_names <- c('SID', 'SEASON', 'NUMBER', 'BASIN', 'SUBBASIN', 'NAME', 'ISO_TIME', 'NATURE', 'LAT', 'LON', 'WMO_WIND', 'WMO_PRES', 'WMO_AGENCY', 'TRACK_TYPE', 'DIST2LAND', 'LANDFALL')
col_class <- c('character', 'integer', 'integer', 'character', 'character', 'character', 'character', 'character', 'real', 'real', 'integer', 'integer', 'character', 'character', 'integer', 'integer')

Then I imported the CSV file, using the sample code as a template to follow. The IBTrACS version 4 website stated that the missing values were encoded as blank cells, thus I assigned an empty space to na.strings. I made sure to rename the columns using the col_names vector as well.

ibtracs <- read.table(
  file = 'ibtracs.NA.list.v04r00.csv',
  sep = ",",
  colClasses = c(col_class, rep("NULL", 147)),
  skip = 77876,
  na.strings = ' '
)
colnames(ibtracs) <- col_names
head(ibtracs, 5)
##             SID SEASON NUMBER BASIN SUBBASIN   NAME            ISO_TIME NATURE
## 1 1969326N10279   1969    113    NA       NA MARTHA 1969-11-25 12:00:00     TS
## 2 1970138N12281   1970     43    NA       CS   ALMA 1970-05-17 18:00:00     TS
## 3 1970138N12281   1970     43    NA       CS   ALMA 1970-05-17 21:00:00     TS
## 4 1970138N12281   1970     43    NA       CS   ALMA 1970-05-18 00:00:00     TS
## 5 1970138N12281   1970     43    NA       CS   ALMA 1970-05-18 03:00:00     TS
##      LAT      LON WMO_WIND WMO_PRES WMO_AGENCY TRACK_TYPE DIST2LAND LANDFALL
## 1  8.500 -82.0000       25       NA hurdat_atl       main         0       NA
## 2 11.500 -79.0000       25       NA hurdat_atl       main       224      224
## 3 11.575 -79.0676       NA       NA       <NA>       main       235      235
## 4 11.700 -79.2000       25       NA hurdat_atl       main       246      246
## 5 11.900 -79.4349       NA       NA       <NA>       main       268      268

I followed the given code to add the MONTH column and to display the structure of my imported data.

ibtracs$ISO_TIME = as.POSIXct(ibtracs$ISO_TIME)
ibtracs$MONTH <- lubridate::month(ibtracs$ISO_TIME)
str(ibtracs, vec.len = 1)
## 'data.frame':    46264 obs. of  17 variables:
##  $ SID       : chr  "1969326N10279" ...
##  $ SEASON    : int  1969 1970 ...
##  $ NUMBER    : int  113 43 ...
##  $ BASIN     : chr  "NA" ...
##  $ SUBBASIN  : chr  "NA" ...
##  $ NAME      : chr  "MARTHA" ...
##  $ ISO_TIME  : POSIXct, format: "1969-11-25 12:00:00" ...
##  $ NATURE    : chr  "TS" ...
##  $ LAT       : num  8.5 11.5 ...
##  $ LON       : num  -82 -79 ...
##  $ WMO_WIND  : int  25 25 ...
##  $ WMO_PRES  : int  NA NA ...
##  $ WMO_AGENCY: chr  "hurdat_atl" ...
##  $ TRACK_TYPE: chr  "main" ...
##  $ DIST2LAND : int  0 224 ...
##  $ LANDFALL  : int  NA 224 ...
##  $ MONTH     : num  11 5 ...

Throughout my exploration, I utilized dplyr and ggplot2 methods to address the claims.

2) The second claim stated that of the named storms in the 2020 Atlantic hurricane season, 7 of the hurricanes intensified into major hurricanes although none of them reached category 5 status.

I first filtered through the filter2020 from the previous cell to assure that all the storms in atlantic_2020 were in the North Atlantic, South Atlantic BASINs and the Caribbean Sea and Gulf of Mexico SUBBASINs.

An important thing to take note of is that according to the data dictionary for IBTrACS, WMO_WIND is shown in knots not MPH. Thus we must remember to convert the threshold of 111 mph for a Category 3 hurricane (which is what categorizes a hurricane as a “major hurricane”) to ~96 knots. The same goes for filtering for Category 5 hurricanes, the hurricane must reach at least 156 mph and so we must convert that to ~135 knots.

atlantic_2020 = filter(filter2020, BASIN == 'NA' | BASIN == "SA" & SUBBASIN == 'CS' | SUBBASIN == 'GM')
cat3_or_above = filter(atlantic_2020, WMO_WIND %in% 96:134)

distinct(cat3_or_above, NAME)
##      NAME
## 1   LAURA
## 2   TEDDY
## 3   DELTA
## 4 EPSILON
## 5    ZETA
## 6     ETA
## 7    IOTA

I created a faceted scatterplot (scatter_2020_mjr) for the 7 major storms in 2020 to show the progression of the storms over the 2020 storm season.

mjr_WMO_tracked = filter(atlantic_2020, NAME == 'DELTA' | NAME == 'EPSILON' | NAME == 'ETA' | NAME == 'IOTA' | NAME == 'LAURA' | NAME == 'TEDDY' | NAME == 'ZETA')

scatter_2020_mjr = ggplot(data = mjr_WMO_tracked, aes(x = ISO_TIME, y = WMO_WIND)) + 
  geom_point(size = .5) + 
  facet_wrap(~NAME) + 
  ggtitle("Wind Speed for the 7 Major Storms in 2020") + 
  theme_bw()

scatter_2020_mjr
## Warning: Removed 250 rows containing missing values (geom_point).

Lastly, to see whether any of the storms reached category 5 status, I simply filtered through mjr_WMO_tracked to see if any hurricanes crossed the threshold of 135 knots and found that not one hurricane did.

filter(mjr_WMO_tracked, WMO_WIND > 135)
##  [1] SID        SEASON     NUMBER     BASIN      SUBBASIN   NAME      
##  [7] ISO_TIME   NATURE     LAT        LON        WMO_WIND   WMO_PRES  
## [13] WMO_AGENCY TRACK_TYPE DIST2LAND  LANDFALL   MONTH     
## <0 rows> (or 0-length row.names)

3) The third claim states that the 2010 Atlantic hurricane season had 19 named storms, but despite this above average activity, not one hurricane hit the United States. First to check this claim, I filtered through ibtracs to set the BASIN and SUBBASIN boundaries of it being categorized as an Atlantic hurricane.

hurricanes_2010 = filter(ibtracs, SEASON == 2010 & BASIN == 'NA' | BASIN == 'SA' & SUBBASIN == 'CS' | SUBBASIN == 'GM')
head(hurricanes_2010, 5)
##             SID SEASON NUMBER BASIN SUBBASIN NAME            ISO_TIME NATURE
## 1 1970138N12281   1970     43    NA       GM ALMA 1970-05-24 00:00:00     TS
## 2 1970138N12281   1970     43    NA       GM ALMA 1970-05-24 03:00:00     TS
## 3 1970138N12281   1970     43    NA       GM ALMA 1970-05-24 06:00:00     TS
## 4 1970138N12281   1970     43    NA       GM ALMA 1970-05-24 09:00:00     TS
## 5 1970138N12281   1970     43    NA       GM ALMA 1970-05-24 12:00:00     TS
##       LAT      LON WMO_WIND WMO_PRES WMO_AGENCY TRACK_TYPE DIST2LAND LANDFALL
## 1 23.0000 -84.0000       25       NA hurdat_atl       main        33       33
## 2 23.4925 -84.0000       NA       NA       <NA>       main        84       84
## 3 24.0000 -84.0000       25       NA hurdat_atl       main       137      137
## 4 24.5550 -84.0074       NA       NA       <NA>       main       199      199
## 5 25.2000 -84.0000       25     1008 hurdat_atl       main       242      196
##   MONTH
## 1     5
## 2     5
## 3     5
## 4     5
## 5     5

Then I filtered through hurricanes_2010 with the LAT and LON boundaries of the United States to see whether any of the storms traversed there. I looked at a map of the United States with latitude and longitudinal lines to find the boundaries to place. I found that for the latitude, the top of the United States veers on 50’N and the bottom close to 30’N. However to account for Florida and the bottom of Texas, I lowered the latitude to 20’N. For the longitudinal boundaries, I applied the same logic but to account for Maine and the West Coast.

filter(hurricanes_2010, LAT %in% 20:50 & LON %in% 60:130) #barriers for LAT and LON
##  [1] SID        SEASON     NUMBER     BASIN      SUBBASIN   NAME      
##  [7] ISO_TIME   NATURE     LAT        LON        WMO_WIND   WMO_PRES  
## [13] WMO_AGENCY TRACK_TYPE DIST2LAND  LANDFALL   MONTH     
## <0 rows> (or 0-length row.names)

I made an animated map with the data from hurricanes_2010 to then see the course of the storms on the map. From this I found that while the storms got pretty close to the coast from Texas to Louisiana in the Gulf of Mexico, it did not directly hit the United States!

world_df = ne_countries(scale = "medium", returnclass = "sf")
class(world_df)
## [1] "sf"         "data.frame"
worldcanvas = ggplot(data = world_df) + 
  geom_sf() + 
  coord_sf(xlim = c(-150, 0), ylim = c(0, 90), expand = TRUE) +
  theme_bw()

map_storms = worldcanvas + 
  geom_path(data = hurricanes_2010,
            aes(x = LON, y = LAT, color = 'red'),
            lineend = "round", size = .2, alpha = 0.8)

animated_2010 = worldcanvas + geom_point(data = hurricanes_2010, aes(x = LON, y = LAT), size = .3) + 
  transition_states(NAME, 
                    transition_length = 2,
                    state_length = 1) + 
  ggtitle("Storms in 2010")
animated_2010

anim_save("Storms in 2010.gif", animation = last_animation(), path = )

5) The fifth claim was that in the period from 1970 to 2020, the 2020 Atlantic hurricane season was the most active on record.

“Active” refers to the season with the most tropical cyclones. I filtered through ibtracs to account for the Atlantic BASINS and SUBBASINS. I then got a count of the distinct storms in the period of 1970-2020 by the years, or SEASONS in this case. I then arranged it in descending order so that we can see it from most amount of tropical cyclones to least. From this, I found that 2020 did in fact have the most active on record, with 2005 being a close second. One takeaway I had from this data was that it is evident the effects of climate change on the water temperature which ultimately has resulted in 2020, the most recent year reported in this database, to having the greatest amount of tropical cyclones!

atlantic_hurricanes = filter(ibtracs, BASIN == 'NA' | BASIN == 'SA' & NATURE == 'TS' | NATURE == "SS")
tropical_storm = count(
  distinct(
    group_by(
      select(atlantic_hurricanes, SEASON, NAME), SEASON
    )
  )
)
head(arrange(tropical_storm, desc(n)), 5)
## # A tibble: 5 × 2
## # Groups:   SEASON [5]
##   SEASON     n
##    <int> <int>
## 1   2020    31
## 2   2005    28
## 3   2021    22
## 4   1995    20
## 5   2010    20

6) The final claim was that in the 2020 Atlantic hurricane season, 14 storms intensified into hurricanes, making this season the one with the most number of hurricanes during the period 1970 to 2020.

To tackle this claim, I started with filtering the previous vector of Atlantic hurricanes to include a filter for the wind speed threshold. A category 1 hurricane is a storm that is at least 64 knots. I then got a count of the distinct storms using the SID rather than the NAME due to the fact that names for storms get reused every 6 years. And since we are looking at the storm database in the time frame between 1970 and 2020, we cannot count the distinct names due to the process of reusing them. After arranging the count of hurricanes in descending order, I found that 2005 actually had the record highest of storms intensifying into hurricanes. 2020 was actually a close second with the 14 storms!

atlantic_storms = filter(atlantic_hurricanes, WMO_WIND > 64)

oneorabove_atlantic = count(
  distinct(
    group_by(
      select(atlantic_storms, SID, SEASON), SEASON
    )
  )
)
head(arrange(oneorabove_atlantic, desc(n)), 5)
## # A tibble: 5 × 2
## # Groups:   SEASON [5]
##   SEASON     n
##    <int> <int>
## 1   2005    15
## 2   2020    14
## 3   2010    12
## 4   1995    11
## 5   1998    10

Animated Map:

In order to create an animated map of the storms in 2020, I first filtered ibtracs to only contain storms from the 2020 SEASON. I then followed along with Professor Sanchez’s guideline for graphing maps.

I chose to utilize the “rnaturalearth package” due its aesthetic advantage that “we can zoom-in without having disorted polygons”. Through the process of following along, I created a world data frame called “world_df” and then created a world map called “world_canvas”. I chose to make the theme: theme_dark() rather than theme_bw() because I thought it looked better aesthetically.

storms2020 = filter(ibtracs, SEASON == 2020)

world_df = ne_countries(scale = "medium", returnclass = "sf")
class(world_df)
## [1] "sf"         "data.frame"
world_canvas = ggplot(data = world_df) + 
  geom_sf() + 
  coord_sf(xlim = c(-150, 0), ylim = c(0, 90), expand = TRUE) + 
  theme_dark()

I then imposed the storms2020 data onto the world_canvas, choosing to have the color of the points be grouped by the NAME of the storms.

storms_imposed = world_canvas + 
  geom_point(data = storms2020, aes(x = LON, y = LAT, color = NAME)) 

Then, to animate it I followed along with the gganimate guidelines by Pederson and Robison. By changing around the aesthetic presets within gganimate, I created an animated world map of the storms in 2020!

animated_2020 = storms_imposed + 
  geom_point(data = storms2020, aes(x = LON, y = LAT, color = NAME)) + 
  transition_states(NAME, 
                    transition_length = 2,
                    state_length = 1)

animated_2020 + ggtitle("Storms in 2020")

animated2020_names = animated_2020 + 
  enter_fade() + enter_drift(x_mod = -1) + 
  exit_shrink() + exit_drift(x_mod = 5)

map2020 = animate(
  animated2020_names + ease_aes(x = 'bounce-out') + enter_fly(x_loc = -1) + exit_fade(), 
  width = 400, height = 600, res = 35
)

anim_save("Storms in 2020.gif", animation = last_animation(), path = )
map2020

Seeing 2020’s storms in animation on a map really highlighted the severity of the increase in total storms over the year. Not to mention the higher culmination of them into hurricanes. From the map, you can also see that some of the storms also hit the United States…

This is in stark comparison to 2010, which had the third highest amount of storms (after 2005 and 2020), yet none of the storms hit the United States, shows the severity of this culmination. The higher amount of storms can be a direct correlate of higher water temperature, which fuels more storms to become hurricanes, which ultimately results in more hurricanes hitting land.

animated_2010