R Markdown

R Bridge Course Final Project

This is a final project to show off what you have learned.

Question for Analysis

  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

For this final project we will look at 94 Atlantic hurricanes from 1950 to 2012, and we will tie in official Saffir-Simpson Hurricane Wind Scale (SSHWS) categorization to the raw dataset. In contrast to prior studies referenced below, this analysis will examine the relationship between scientifically categorized storm rankings and the relative death and destruction left in their wake.

One meaningful question for such an analysis is:

Is there a linear relationship between a storm’s magnitude of destruction and its SSHWS category ranking?

  1. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Sample Dataset: Atlantic Hurricanes

The Atlantic hurricanes dataset comes from a Git repository of datasets at http://vincentarelbundock.github.io/Rdatasets/.

An extremely abbreviated literature review indicates that it was first used by Jung et al. in 2014 in their article for the Proceedings of the National Academy of Sciences (see References section below for full citation). These scientists posited that female-named hurricanes are deadlier than male-named hurricanes due to culturally ingrained psychological biases that temper our natural responses to threats based on name-related gender associations. They hypothesized that storms bearing feminine names would result in less defensive postures from the general population than storms bearing masculine names, thereby facilitating greater destruction and even death to less prepared populations from the feminine-named storms. Their results disproved their hypothesis.

According to statistics gathered by the publishing entity the article’s abstract has been used over 190,000 times since it original publication in 2014 at http://pnas.org/content/111/24/8782/tab-article-info.

Data Exploration

The presentation approach is up to you but it should contain the following: 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

So for our analysis, first let’s examine the structure of the Hurricanes dataset:

gitURL1 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/hurricanes.csv"

hurricanes <- read.csv(gitURL1)

str(hurricanes)
## 'data.frame':    94 obs. of  13 variables:
##  $ X             : chr  "Easy1950" "King1950" "Able1952" "Barbara1953" ...
##  $ Name          : chr  "Easy" "King" "Able" "Barbara" ...
##  $ Year          : int  1950 1950 1952 1953 1953 1954 1954 1954 1955 1955 ...
##  $ LF.WindsMPH   : int  120 130 85 85 85 120 120 145 120 85 ...
##  $ LF.PressureMB : int  958 955 985 987 985 960 954 938 962 987 ...
##  $ LF.times      : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ BaseDamage    : num  3.3 28 2.75 1 0.2 ...
##  $ NDAM2014      : int  1870 6030 170 65 18 21375 3520 28500 2270 17250 ...
##  $ AffectedStates: chr  "FL" "FL" "SC" "NC" ...
##  $ firstLF       : chr  "9/4/1950" "10/17/1950" "8/30/1952" "8/13/1953" ...
##  $ deaths        : int  2 4 3 1 0 60 20 20 0 200 ...
##  $ mf            : chr  "f" "m" "m" "f" ...
##  $ BaseDam2014   : num  32.42 275.07 24.57 8.87 1.77 ...

Let’s also view summary statistics about the hurricanes dataset.

summary(hurricanes)
##       X                 Name                Year       LF.WindsMPH   
##  Length:94          Length:94          Min.   :1950   Min.   : 75.0  
##  Class :character   Class :character   1st Qu.:1964   1st Qu.: 85.0  
##  Mode  :character   Mode  :character   Median :1985   Median :105.0  
##                                        Mean   :1982   Mean   :104.7  
##                                        3rd Qu.:1999   3rd Qu.:120.0  
##                                        Max.   :2012   Max.   :190.0  
##  LF.PressureMB       LF.times       BaseDamage          NDAM2014    
##  Min.   : 909.0   Min.   :1.000   Min.   :    0.20   Min.   :    1  
##  1st Qu.: 950.0   1st Qu.:1.000   1st Qu.:   25.75   1st Qu.:  290  
##  Median : 963.5   Median :1.000   Median :  200.00   Median : 2090  
##  Mean   : 964.4   Mean   :1.117   Mean   : 3340.70   Mean   : 8433  
##  3rd Qu.: 982.8   3rd Qu.:1.000   3rd Qu.: 1500.00   3rd Qu.: 9050  
##  Max.   :1003.0   Max.   :3.000   Max.   :81000.00   Max.   :88420  
##  AffectedStates       firstLF              deaths             mf           
##  Length:94          Length:94          Min.   :   0.00   Length:94         
##  Class :character   Class :character   1st Qu.:   2.00   Class :character  
##  Mode  :character   Mode  :character   Median :   5.00   Mode  :character  
##                                        Mean   :  44.17                     
##                                        3rd Qu.:  21.00                     
##                                        Max.   :1836.00                     
##   BaseDam2014      
##  Min.   :    1.04  
##  1st Qu.:   93.11  
##  Median :  908.33  
##  Mean   : 4830.19  
##  3rd Qu.: 3341.62  
##  Max.   :98195.39

We have the name and year of each storm, with some names repeated across different years. The wind speed in MPH at landfall (LF) ranges from a low of 75 MPH to a high of 190 MPH. Pressure in Millibars at LF has a median of 963.5 and a mean of 964.4, suggesting close to a normal distribution of atmospheric pressure. The # of times the storm made landfall ranges from 1 to 3, with a mean of 1.117 indicating that most hurricanes make only one landfall. Donna, in 1960, is the major outlier, having made 3 separate landfalls in Florida, North Carolina and New York as she chewed her way up the east coast. Base damage in dollars as of the year of the storm range from 0.2 to 81000, and normalized (i.e. financially adjusted valuations to 2014 dollars) property damage totals ranged from $1M to $88,420M, with a mean of $8,433M but a median of only $2,090M. The # of deaths average 44.17 with a median of 5 deaths. Katrina is the outlier, having caused 1,836 deaths.

Data Wrangling

  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

We can also see that the category of each hurricane is apparently not included in the first dataset. Therefore, we will need to acquire the Saffir-Simpson Hurricane Wind Scale (SSHWS) category ranking in order to analyze each hurricane’s relative impacts by category. The SSHWS categorization consists of a scale of 1 to 5, with 5 being the most destructive, and the categories are based on measured wind speeds.

Fortunately the SSHWS is available from the National Hurricane Center (part of NOAA) at https://www.nhc.noaa.gov/aboutsshws.php. Let’s import this data from a csv file made from the website’s data and weave it into this analysis by merging the category field into the original Git dataset based on observed wind speeds at time of landfall.

Next let’s examine the structure of the Wind Speed/Category dataset:

gitURL2 <- "https://raw.githubusercontent.com/douglasbarley/coursedata/master/Saffir-SimpsonHurricaneWindScale.csv"

categories <- read.csv(gitURL2)

str(categories)
## 'data.frame':    15 obs. of  3 variables:
##  $ Category                              : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ Sustained.Winds                       : chr  "74-95 mph" "64-82 kt" "119-153 km/h" "96-110 mph" ...
##  $ Types.of.Damage.Due.to.Hurricane.Winds: chr  "Very dangerous winds will produce some damage: Well-constructed frame homes could have damage to roof, shingles"| __truncated__ "Very dangerous winds will produce some damage: Well-constructed frame homes could have damage to roof, shingles"| __truncated__ "Very dangerous winds will produce some damage: Well-constructed frame homes could have damage to roof, shingles"| __truncated__ "Extremely dangerous winds will cause extensive damage: Well-constructed frame homes could sustain major roof an"| __truncated__ ...

There is good text-based information describing the type of damage for each category. However, we only want each category number and its corresponding wind speeds in MPH (but not in Knots or KM/H). So let’s make a subset of the data that has only category and sustained.winds in MPH, and while we are subsetting the dataframe let’s also separate the Min and Max wind speeds into their own respective fields.

library(sqldf)

catwinds <- sqldf("select [Category],[Sustained.Winds],cast(substr([Sustained.Winds],0,case when instr([Sustained.Winds],'-') = 0 then instr([Sustained.Winds],' ') else instr([Sustained.Winds],'-') end) AS INT) AS MinSpeed, cast(case when instr([Sustained.Winds],'-') = 0 then 999 else substr([Sustained.Winds],instr([Sustained.Winds],'-')+1,instr([Sustained.Winds],' ')-instr([Sustained.Winds],'-')) end AS INT) AS MaxSpeed from categories where [Sustained.Winds] LIKE '%mph%'")

catwinds                   
##   Category   Sustained.Winds MinSpeed MaxSpeed
## 1        1         74-95 mph       74       95
## 2        2        96-110 mph       96      110
## 3        3       111-129 mph      111      129
## 4        4       130-156 mph      130      156
## 5        5 157 mph or higher      157      999

With this new data we can assign a category to each storm in the hurricanes dataset by comparing its LF windspeed to the Min and Max windspeeds in the SSHWS category criteria.

for (i in 1:nrow(hurricanes)){
  hurricanes$cat[i] <- catwinds$Category[hurricanes$LF.WindsMPH[i] >= catwinds$MinSpeed & hurricanes$LF.WindsMPH[i] <= catwinds$MaxSpeed]
}

# let's see what the above script did to the hurricanes table by looking at a subset of it

hurricanescat <-  sqldf("select [Name],[Year],[LF.WindsMPH], [cat] from hurricanes")

hurricanescat
##         Name Year LF.WindsMPH cat
## 1       Easy 1950         120   3
## 2       King 1950         130   4
## 3       Able 1952          85   1
## 4    Barbara 1953          85   1
## 5   Florence 1953          85   1
## 6      Carol 1954         120   3
## 7       Edna 1954         120   3
## 8      Hazel 1954         145   4
## 9     Connie 1955         120   3
## 10     Diane 1955          85   1
## 11      Ione 1955         120   3
## 12    Flossy 1956         105   2
## 13    Audrey 1957         145   4
## 14    Helene 1958         120   3
## 15     Debra 1959          85   1
## 16    Gracie 1959         120   3
## 17     Donna 1960         145   4
## 18     Ethel 1960          85   1
## 19     Carla 1961         145   4
## 20     Cindy 1963          85   1
## 21      Cleo 1964         105   2
## 22      Dora 1964         105   2
## 23     Hilda 1964         120   3
## 24    Isbell 1964         105   2
## 25     Betsy 1965         120   3
## 26      Alma 1966         105   2
## 27      Inez 1966          85   1
## 28    Beulah 1967         120   3
## 29    Gladys 1968         105   2
## 30   Camille 1969         190   5
## 31     Celia 1970         120   3
## 32      Fern 1971          85   1
## 33     Edith 1971         105   2
## 34    Ginger 1971          85   1
## 35     Agnes 1972          85   1
## 36    Carmen 1974         120   3
## 37    Eloise 1975         120   3
## 38     Belle 1976          85   1
## 39      Babe 1977          85   1
## 40       Bob 1979          85   1
## 41     David 1979         105   2
## 42  Frederic 1979         120   3
## 43     Allen 1980         115   3
## 44    Alicia 1983         115   3
## 45     Diana 1984         110   2
## 46       Bob 1985          75   1
## 47     Danny 1985          90   1
## 48     Elena 1985         115   3
## 49    Gloria 1985         120   3
## 50      Juan 1985          85   1
## 51      Kate 1985         100   2
## 52    Bonnie 1986          85   1
## 53   Charley 1986          75   1
## 54     Floyd 1987          75   1
## 55  Florence 1988          80   1
## 56   Chantal 1989          80   1
## 57      Hugo 1989         140   4
## 58     Jerry 1989          85   1
## 59       Bob 1991         105   2
## 60    Andrew 1992         170   5
## 61     Emily 1993         115   3
## 62      Erin 1995         100   2
## 63      Opal 1995         115   3
## 64    Bertha 1996         105   2
## 65      Fran 1996         115   3
## 66     Danny 1997          80   1
## 67    Bonnie 1998         110   2
## 68      Earl 1998          80   1
## 69   Georges 1998         105   2
## 70      Bret 1999         115   3
## 71     Floyd 1999         105   2
## 72     Irene 1999          80   1
## 73      Lili 2002          90   1
## 74 Claudette 2003          90   1
## 75    Isabel 2003         105   2
## 76      Alex 2004          80   1
## 77   Charley 2004         150   4
## 78    Gaston 2004          75   1
## 79   Frances 2004         105   2
## 80      Ivan 2004         120   3
## 81    Jeanne 2004         120   3
## 82     Cindy 2005          75   1
## 83    Dennis 2005         120   3
## 84   Katrina 2005         125   3
## 85   Ophelia 2005          75   1
## 86      Rita 2005         115   3
## 87     Wilma 2005         120   3
## 88  Humberto 2007          90   1
## 89     Dolly 2008          85   1
## 90    Gustav 2008         105   2
## 91       Ike 2008         110   2
## 92     Irene 2011          75   1
## 93     Isaac 2012          80   1
## 94     Sandy 2012          75   1

We now have an official category strength assigned to each hurricane based on its wind speed in MPH at landfall.

Data Visualization

  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

A base histogram is helpful for seeing how many hurricanes occurred in each category.

hist(hurricanes$cat, main = "Hurricanes Histogram", xlab = "Hurricane Category", )

It is interesting to note that there is a greater frequency of cat 3 storms than cat 2 storms in this dataset.

We can make a base boxplot with only one dimension, so let’s look at the normalized damage $ in a boxplot.

boxplot(hurricanes$NDAM2014)

Note that there are only 4 storms with devastating damage that are significantly skewing the data. Also note how low the median line falls the Interquartile Range (IQR). There must be many low damage storms below the line to pull the median that low, but it is hard to see in the base boxplot.

To make a base scatterplot we need data with at least two dimensions. Let’s look at the # of deaths versus normalized property damage $.

plot(deaths ~ NDAM2014, data = hurricanes)

We can see how Katrina has devastated the base scatterplot as the significant outlier in the # deaths it caused. Perhaps we can see better results using ggplot2?

Let’s revisit the histogram using ggplot2.

library(ggplot2)
ggplot(data = hurricanes) + geom_histogram(aes(x = cat))

This looks more polished than the base histogram, and the background grid makes it easier to see counts across the graph. But what if we wanted to know the probability of a storm being a certain category? We could use a density graph for that.

ggplot(data = hurricanes) + geom_density(aes(x = cat), fill = "grey50")

This graph shows that the probability of a storm of cat 1 is highest, followed by a cat 3 then a cat 2. So if you hear that a strong storm is coming it’s more probable that it will be a cat 3 than a cat 2. Good to know, right?

Next how can we improve on our boxplot? Let’s try adding another dimension to it so we can see if that helps with the outliers:

ggplot(hurricanes, aes(y = NDAM2014, x = cat, group = cat)) + geom_boxplot()

That’s better! Here we can see that as the category of the storm increases the median normalized property damage increases through its respective IQRs at what appears to be a non-linear rate…in fact it looks like it could be exponential. That begins to answer part of our meaningful question.

But we cannot rest on our laurels and idly play our violin as Rome burns at this point, so let’s pick up our violin and see what kind of plot we can make with it.

ggplot(hurricanes, aes(y = NDAM2014, x = cat, group = cat)) + geom_point() + geom_violin()

This point/violin plot shows, once again, how Katrina is an outlying cat 3 storm having caused over $80B in damage. But it also includes visuals of the density of the data in the bubbles near the bottom of each category’s shape, which is pretty cool. For example, you can see the low but broad flattened density at the bottom of cat 1.

Last let’s see what features ggplot2 adds to scatterplot capabilities.

ggplot(hurricanes, aes(x = NDAM2014, y = deaths)) + geom_point(aes(color = cat))  + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
## Warning: Transformation introduced infinite values in continuous y-axis

Adding color to the data points and a legend helps to clarify the plot. The grid format makes it easier to see where points lie with respect to both axes, and transforming both axes to log(10) allows for a better visual distribution of the points in the scatterplot.

Conclusion

Please write a brief conclusion paragraph in R markdown at the end.

The original question was: Is there a linear relationship between a storm’s magnitude of destruction and its SSHWS category ranking? Using the base R visualization tools it was difficult to see any relationship between a storm’s category and the destruction that it caused. Looking at the boxplot of destruction by category, it appears that for category 1, 2 and 3 hurricanes there may be a slightly more linear relationship to the damage that they cause, but as storms grow beyond cat 3 to cat 4 and 5, the magnitude of destruction appears to grow along an exponential curve. Therefore, there is no simple linear relationship between the category of a storm and the destruction in its wake.

References

Female hurricanes are deadlier Kiju Jung, Sharon Shavitt, Madhu Viswanathan, Joseph M. Hilbe Proceedings of the National Academy of Sciences Jun 2014, 111 (24) 8782-8787; DOI: 10.1073/pnas.1402786111