This is a final project to show off what you have learned.
For this final project we will look at 94 Atlantic hurricanes from 1950 to 2012, and we will tie in official Saffir-Simpson Hurricane Wind Scale (SSHWS) categorization to the raw dataset. In contrast to prior studies referenced below, this analysis will examine the relationship between scientifically categorized storm rankings and the relative death and destruction left in their wake.
One meaningful question for such an analysis is:
The Atlantic hurricanes dataset comes from a Git repository of datasets at http://vincentarelbundock.github.io/Rdatasets/.
An extremely abbreviated literature review indicates that it was first used by Jung et al. in 2014 in their article for the Proceedings of the National Academy of Sciences (see References section below for full citation). These scientists posited that female-named hurricanes are deadlier than male-named hurricanes due to culturally ingrained psychological biases that temper our natural responses to threats based on name-related gender associations. They hypothesized that storms bearing feminine names would result in less defensive postures from the general population than storms bearing masculine names, thereby facilitating greater destruction and even death to less prepared populations from the feminine-named storms. Their results disproved their hypothesis.
According to statistics gathered by the publishing entity the article’s abstract has been used over 190,000 times since it original publication in 2014 at http://pnas.org/content/111/24/8782/tab-article-info.
The presentation approach is up to you but it should contain the following: 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
So for our analysis, first let’s examine the structure of the Hurricanes dataset:
gitURL1 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/hurricanes.csv"
hurricanes <- read.csv(gitURL1)
str(hurricanes)
## 'data.frame': 94 obs. of 13 variables:
## $ X : chr "Easy1950" "King1950" "Able1952" "Barbara1953" ...
## $ Name : chr "Easy" "King" "Able" "Barbara" ...
## $ Year : int 1950 1950 1952 1953 1953 1954 1954 1954 1955 1955 ...
## $ LF.WindsMPH : int 120 130 85 85 85 120 120 145 120 85 ...
## $ LF.PressureMB : int 958 955 985 987 985 960 954 938 962 987 ...
## $ LF.times : int 1 1 1 1 1 2 1 1 1 1 ...
## $ BaseDamage : num 3.3 28 2.75 1 0.2 ...
## $ NDAM2014 : int 1870 6030 170 65 18 21375 3520 28500 2270 17250 ...
## $ AffectedStates: chr "FL" "FL" "SC" "NC" ...
## $ firstLF : chr "9/4/1950" "10/17/1950" "8/30/1952" "8/13/1953" ...
## $ deaths : int 2 4 3 1 0 60 20 20 0 200 ...
## $ mf : chr "f" "m" "m" "f" ...
## $ BaseDam2014 : num 32.42 275.07 24.57 8.87 1.77 ...
Let’s also view summary statistics about the hurricanes dataset.
summary(hurricanes)
## X Name Year LF.WindsMPH
## Length:94 Length:94 Min. :1950 Min. : 75.0
## Class :character Class :character 1st Qu.:1964 1st Qu.: 85.0
## Mode :character Mode :character Median :1985 Median :105.0
## Mean :1982 Mean :104.7
## 3rd Qu.:1999 3rd Qu.:120.0
## Max. :2012 Max. :190.0
## LF.PressureMB LF.times BaseDamage NDAM2014
## Min. : 909.0 Min. :1.000 Min. : 0.20 Min. : 1
## 1st Qu.: 950.0 1st Qu.:1.000 1st Qu.: 25.75 1st Qu.: 290
## Median : 963.5 Median :1.000 Median : 200.00 Median : 2090
## Mean : 964.4 Mean :1.117 Mean : 3340.70 Mean : 8433
## 3rd Qu.: 982.8 3rd Qu.:1.000 3rd Qu.: 1500.00 3rd Qu.: 9050
## Max. :1003.0 Max. :3.000 Max. :81000.00 Max. :88420
## AffectedStates firstLF deaths mf
## Length:94 Length:94 Min. : 0.00 Length:94
## Class :character Class :character 1st Qu.: 2.00 Class :character
## Mode :character Mode :character Median : 5.00 Mode :character
## Mean : 44.17
## 3rd Qu.: 21.00
## Max. :1836.00
## BaseDam2014
## Min. : 1.04
## 1st Qu.: 93.11
## Median : 908.33
## Mean : 4830.19
## 3rd Qu.: 3341.62
## Max. :98195.39
We have the name and year of each storm, with some names repeated across different years. The wind speed in MPH at landfall (LF) ranges from a low of 75 MPH to a high of 190 MPH. Pressure in Millibars at LF has a median of 963.5 and a mean of 964.4, suggesting close to a normal distribution of atmospheric pressure. The # of times the storm made landfall ranges from 1 to 3, with a mean of 1.117 indicating that most hurricanes make only one landfall. Donna, in 1960, is the major outlier, having made 3 separate landfalls in Florida, North Carolina and New York as she chewed her way up the east coast. Base damage in dollars as of the year of the storm range from 0.2 to 81000, and normalized (i.e. financially adjusted valuations to 2014 dollars) property damage totals ranged from $1M to $88,420M, with a mean of $8,433M but a median of only $2,090M. The # of deaths average 44.17 with a median of 5 deaths. Katrina is the outlier, having caused 1,836 deaths.
We can also see that the category of each hurricane is apparently not included in the first dataset. Therefore, we will need to acquire the Saffir-Simpson Hurricane Wind Scale (SSHWS) category ranking in order to analyze each hurricane’s relative impacts by category. The SSHWS categorization consists of a scale of 1 to 5, with 5 being the most destructive, and the categories are based on measured wind speeds.
Fortunately the SSHWS is available from the National Hurricane Center (part of NOAA) at https://www.nhc.noaa.gov/aboutsshws.php. Let’s import this data from a csv file made from the website’s data and weave it into this analysis by merging the category field into the original Git dataset based on observed wind speeds at time of landfall.
Next let’s examine the structure of the Wind Speed/Category dataset:
gitURL2 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/Saffir-SimpsonHurricaneWindScale.csv"
categories <- read.csv(gitURL2)
str(categories)
## 'data.frame': 15 obs. of 3 variables:
## $ Category : int 1 1 1 2 2 2 3 3 3 4 ...
## $ Sustained.Winds : chr "74-95 mph" "64-82 kt" "119-153 km/h" "96-110 mph" ...
## $ Types.of.Damage.Due.to.Hurricane.Winds: chr "Very dangerous winds will produce some damage: Well-constructed frame homes could have damage to roof, shingles"| __truncated__ "Very dangerous winds will produce some damage: Well-constructed frame homes could have damage to roof, shingles"| __truncated__ "Very dangerous winds will produce some damage: Well-constructed frame homes could have damage to roof, shingles"| __truncated__ "Extremely dangerous winds will cause extensive damage: Well-constructed frame homes could sustain major roof an"| __truncated__ ...
There is good text-based information describing the type of damage for each category. However, we only want each category number and its corresponding wind speeds in MPH (but not in Knots or KM/H). So let’s make a subset of the data that has only category and sustained.winds in MPH, and while we are subsetting the dataframe let’s also separate the Min and Max wind speeds into their own respective fields.
library(sqldf)
catwinds <- sqldf("select [Category],[Sustained.Winds],cast(substr([Sustained.Winds],0,case when instr([Sustained.Winds],'-') = 0 then instr([Sustained.Winds],' ') else instr([Sustained.Winds],'-') end) AS INT) AS MinSpeed, cast(case when instr([Sustained.Winds],'-') = 0 then 999 else substr([Sustained.Winds],instr([Sustained.Winds],'-')+1,instr([Sustained.Winds],' ')-instr([Sustained.Winds],'-')) end AS INT) AS MaxSpeed from categories where [Sustained.Winds] LIKE '%mph%'")
catwinds
## Category Sustained.Winds MinSpeed MaxSpeed
## 1 1 74-95 mph 74 95
## 2 2 96-110 mph 96 110
## 3 3 111-129 mph 111 129
## 4 4 130-156 mph 130 156
## 5 5 157 mph or higher 157 999
With this new data we can assign a category to each storm in the hurricanes dataset by comparing its LF windspeed to the Min and Max windspeeds in the SSHWS category criteria.
for (i in 1:nrow(hurricanes)){
hurricanes$cat[i] <- catwinds$Category[hurricanes$LF.WindsMPH[i] >= catwinds$MinSpeed & hurricanes$LF.WindsMPH[i] <= catwinds$MaxSpeed]
}
# let's see what the above script did to the hurricanes table by looking at a subset of it
hurricanescat <- sqldf("select [Name],[Year],[LF.WindsMPH], [cat] from hurricanes")
hurricanescat
## Name Year LF.WindsMPH cat
## 1 Easy 1950 120 3
## 2 King 1950 130 4
## 3 Able 1952 85 1
## 4 Barbara 1953 85 1
## 5 Florence 1953 85 1
## 6 Carol 1954 120 3
## 7 Edna 1954 120 3
## 8 Hazel 1954 145 4
## 9 Connie 1955 120 3
## 10 Diane 1955 85 1
## 11 Ione 1955 120 3
## 12 Flossy 1956 105 2
## 13 Audrey 1957 145 4
## 14 Helene 1958 120 3
## 15 Debra 1959 85 1
## 16 Gracie 1959 120 3
## 17 Donna 1960 145 4
## 18 Ethel 1960 85 1
## 19 Carla 1961 145 4
## 20 Cindy 1963 85 1
## 21 Cleo 1964 105 2
## 22 Dora 1964 105 2
## 23 Hilda 1964 120 3
## 24 Isbell 1964 105 2
## 25 Betsy 1965 120 3
## 26 Alma 1966 105 2
## 27 Inez 1966 85 1
## 28 Beulah 1967 120 3
## 29 Gladys 1968 105 2
## 30 Camille 1969 190 5
## 31 Celia 1970 120 3
## 32 Fern 1971 85 1
## 33 Edith 1971 105 2
## 34 Ginger 1971 85 1
## 35 Agnes 1972 85 1
## 36 Carmen 1974 120 3
## 37 Eloise 1975 120 3
## 38 Belle 1976 85 1
## 39 Babe 1977 85 1
## 40 Bob 1979 85 1
## 41 David 1979 105 2
## 42 Frederic 1979 120 3
## 43 Allen 1980 115 3
## 44 Alicia 1983 115 3
## 45 Diana 1984 110 2
## 46 Bob 1985 75 1
## 47 Danny 1985 90 1
## 48 Elena 1985 115 3
## 49 Gloria 1985 120 3
## 50 Juan 1985 85 1
## 51 Kate 1985 100 2
## 52 Bonnie 1986 85 1
## 53 Charley 1986 75 1
## 54 Floyd 1987 75 1
## 55 Florence 1988 80 1
## 56 Chantal 1989 80 1
## 57 Hugo 1989 140 4
## 58 Jerry 1989 85 1
## 59 Bob 1991 105 2
## 60 Andrew 1992 170 5
## 61 Emily 1993 115 3
## 62 Erin 1995 100 2
## 63 Opal 1995 115 3
## 64 Bertha 1996 105 2
## 65 Fran 1996 115 3
## 66 Danny 1997 80 1
## 67 Bonnie 1998 110 2
## 68 Earl 1998 80 1
## 69 Georges 1998 105 2
## 70 Bret 1999 115 3
## 71 Floyd 1999 105 2
## 72 Irene 1999 80 1
## 73 Lili 2002 90 1
## 74 Claudette 2003 90 1
## 75 Isabel 2003 105 2
## 76 Alex 2004 80 1
## 77 Charley 2004 150 4
## 78 Gaston 2004 75 1
## 79 Frances 2004 105 2
## 80 Ivan 2004 120 3
## 81 Jeanne 2004 120 3
## 82 Cindy 2005 75 1
## 83 Dennis 2005 120 3
## 84 Katrina 2005 125 3
## 85 Ophelia 2005 75 1
## 86 Rita 2005 115 3
## 87 Wilma 2005 120 3
## 88 Humberto 2007 90 1
## 89 Dolly 2008 85 1
## 90 Gustav 2008 105 2
## 91 Ike 2008 110 2
## 92 Irene 2011 75 1
## 93 Isaac 2012 80 1
## 94 Sandy 2012 75 1
We now have an official category strength assigned to each hurricane based on its wind speed in MPH at landfall.
A base histogram is helpful for seeing how many hurricanes occurred in each category.
hist(hurricanes$cat, main = "Hurricanes Histogram", xlab = "Hurricane Category", )
It is interesting to note that there is a greater frequency of cat 3 storms than cat 2 storms in this dataset.
We can make a base boxplot with only one dimension, so let’s look at the normalized damage $ in a boxplot.
boxplot(hurricanes$NDAM2014)
Note that there are only 4 storms with devastating damage that are significantly skewing the data. Also note how low the median line falls the Interquartile Range (IQR). There must be many low damage storms below the line to pull the median that low, but it is hard to see in the base boxplot.
To make a base scatterplot we need data with at least two dimensions. Let’s look at the # of deaths versus normalized property damage $.
plot(deaths ~ NDAM2014, data = hurricanes)
We can see how Katrina has devastated the base scatterplot as the significant outlier in the # deaths it caused. Perhaps we can see better results using ggplot2?
Let’s revisit the histogram using ggplot2.
library(ggplot2)
ggplot(data = hurricanes) + geom_histogram(aes(x = cat))
This looks more polished than the base histogram, and the background grid makes it easier to see counts across the graph. But what if we wanted to know the probability of a storm being a certain category? We could use a density graph for that.
ggplot(data = hurricanes) + geom_density(aes(x = cat), fill = "grey50")
This graph shows that the probability of a storm of cat 1 is highest, followed by a cat 3 then a cat 2. So if you hear that a strong storm is coming it’s more probable that it will be a cat 3 than a cat 2. Good to know, right?
Next how can we improve on our boxplot? Let’s try adding another dimension to it so we can see if that helps with the outliers:
ggplot(hurricanes, aes(y = NDAM2014, x = cat, group = cat)) + geom_boxplot()
That’s better! Here we can see that as the category of the storm increases the median normalized property damage increases through its respective IQRs at what appears to be a non-linear rate…in fact it looks like it could be exponential. That begins to answer part of our meaningful question.
But we cannot rest on our laurels and idly play our violin as Rome burns at this point, so let’s pick up our violin and see what kind of plot we can make with it.
ggplot(hurricanes, aes(y = NDAM2014, x = cat, group = cat)) + geom_point() + geom_violin()
This point/violin plot shows, once again, how Katrina is an outlying cat 3 storm having caused over $80B in damage. But it also includes visuals of the density of the data in the bubbles near the bottom of each category’s shape, which is pretty cool. For example, you can see the low but broad flattened density at the bottom of cat 1.
Last let’s see what features ggplot2 adds to scatterplot capabilities.
ggplot(hurricanes, aes(x = NDAM2014, y = deaths)) + geom_point(aes(color = cat)) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
## Warning: Transformation introduced infinite values in continuous y-axis
Adding color to the data points and a legend helps to clarify the plot. The grid format makes it easier to see where points lie with respect to both axes, and transforming both axes to log(10) allows for a better visual distribution of the points in the scatterplot.
Please write a brief conclusion paragraph in R markdown at the end.
The original question was: Is there a linear relationship between a storm’s magnitude of destruction and its SSHWS category ranking? Using the base R visualization tools it was difficult to see any relationship between a storm’s category and the destruction that it caused. Looking at the boxplot of destruction by category, it appears that for category 1, 2 and 3 hurricanes there may be a slightly more linear relationship to the damage that they cause, but as storms grow beyond cat 3 to cat 4 and 5, the magnitude of destruction appears to grow along an exponential curve. Therefore, there is no simple linear relationship between the category of a storm and the destruction in its wake.
Female hurricanes are deadlier Kiju Jung, Sharon Shavitt, Madhu Viswanathan, Joseph M. Hilbe Proceedings of the National Academy of Sciences Jun 2014, 111 (24) 8782-8787; DOI: 10.1073/pnas.1402786111