The Purpose of this document is to examine the most damaging storm types recorded in the US based on NOAA data downloaded from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 More information on this data-set can be found at: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf We will look at the most damaging storms in terms of health, (fatalities and injuries) and economic consequences. (crop and property damage in USD) This document will describe all data preparation and transformations, along with the results of the analysis. The dplyr, ggplot2, scales, and reshape2 libraries are used for the analysis.
First, we load the necessary libraries and data-set .csv file downloaded from the above link
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
library(reshape2)
SData <- read.csv("repdata-data-StormData.csv")
Next, we subset the data to make it more manageable, extracting only the columns we need for the analysis. These include the event type, number of fatalities and injuries, property and crop damage, and the columns that denote the magnitude of damage in thousands, millions, or billions. (PROPDMGEXP and CROPDMGEXP)
IData <- SData[c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
We need to process the data the extract the most useable and accurate results for both economic and health consequences. Since the economic data has more processing steps, we will start there.
Upon examining the data, there are a few anomalies we need to address.
str(IData)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
First, the variables that describe the damage magnitude are not case consistent and contain values which we can’t recognize based on the data-set description provided by the NOAA website.
levels(IData$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(IData$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Luckily, few of the 902297 observations in the data-set have unrecognized characters in the PROPDMGEXP and CROPDMGEXP fields, so they can be ignored without upsetting the aggregate analysis much. We can transform all the cases to upper for the values that are recognized (k, m, and b in the case of the magnitude variables) for consistency. To be on the safe side, we can convert cases for the Event Types to upper as well. Since R interprets the same character in different cases as being unique, this should alleviate any issues upper/lower case inconsistency may cause.
IData$PROPDMGEXP <- factor(toupper(IData$PROPDMGEXP))
IData$CROPDMGEXP <- factor(toupper(IData$CROPDMGEXP))
IData$EVTYPE <- toupper(IData$EVTYPE)
Next, we want to change the variable names of the base damage value to identify them as such and create numeric variables that multiply each base by its magnitude in dollars so we can compare “apples to apples” as far as damage value is concerned. Any unknown magnitude entries will be converted at 0 value.
IData <- rename(IData, PROPDMGBASE = PROPDMG, CROPDMGBASE = CROPDMG)
IData$PROPDMG <- ifelse(IData$PROPDMGEXP == "K", IData$PROPDMGBASE * 1000,
ifelse(IData$PROPDMGEXP == "M",IData$PROPDMGBASE * 1000000,
ifelse(IData$PROPDMGEXP == "B",IData$PROPDMGBASE * 1000000000, 0)))
IData$CROPDMG <- ifelse(IData$CROPDMGEXP == "K", IData$CROPDMGBASE * 1000,
ifelse(IData$CROPDMGEXP == "M",IData$CROPDMGBASE * 1000000,
ifelse(IData$CROPDMGEXP == "B",IData$CROPDMGBASE * 1000000000, 0)))
We can then create a total damage variable that we will use to rank the different storm types by the amount of total economic damage caused. While we are at it, we can also create a total injuries and fatalities variable for use in the health analysis.
IData$TOTDMG <- IData$PROPDMG + IData$CROPDMG
IData$TOTINJFAT <- IData$FATALITIES + IData$INJURIES
At this point, we have created all the variables we will be working with and cleaned the data. It is a good idea to take a look at what we have:
str(IData)
## 'data.frame': 902297 obs. of 11 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMGBASE: num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP : Factor w/ 17 levels "","-","?","+",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMGBASE: num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP : Factor w/ 7 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PROPDMG : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TOTDMG : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ TOTINJFAT : num 15 0 2 2 2 6 1 0 15 0 ...
Our next step will be to group damage by storm type (factor level), summarize by the sum of Crop, Property, and Total Damage, and arrange in descending order by damage caused using the dplyr package so we can identify the top 10 most damaging storm types.
TDmg <- group_by(IData, EVTYPE)
DmgSummary <- summarize(TDmg, PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG), TOTALDMG = sum(TOTDMG))
DmgSummary <- arrange(DmgSummary, desc(TOTALDMG))
We can now preview the top 10:
head(DmgSummary, 10)
## Source: local data frame [10 x 4]
##
## EVTYPE PROPDMG CROPDMG TOTALDMG
## 1 FLOOD 144657709800 5661968450 150319678250
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160480 414953110 57352113590
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732266720 3025954450 18758221170
## 6 FLASH FLOOD 16140811510 1421317100 17562128610
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927810 5022113500 8967041310
At this point, we subset the data into the top 10 most economically damaging categories for ease of graphing. Note that we are sub setting both the summary data and the data-set with all of the individual points so we can see both the aggregate storm damage and the most individually damaging storms within each category.
Top10Type <- head(DmgSummary$EVTYPE, 10) #get the names of the top 10 from the EVTYPE variable
Top10Dmg <- filter(DmgSummary, EVTYPE %in% Top10Type) #use these names to filter the data
Top10Dat <- filter(IData, EVTYPE %in% Top10Type)
We follow a similar process to extract the injury and fatality information we want from the raw data. This data is easier to work with, and we have the added benefit of using a few objects we created for the economic damage data. Since we already grouped the initial data set we created by the event type, we can use this object (TDmg) to summarize the number of injuries and fatalities.
HlthSummary <- summarize(TDmg, FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), TOTINJFAT = sum(TOTINJFAT))
Now, we arrange this summary in descending order by the total and filter the top 10 most dangerous storm types. Again, we filter both the summary and the individual data points to see if there are any extreme individual cases.
HlthSummary <- arrange(HlthSummary, desc(TOTINJFAT))
Top10HType <- head(HlthSummary$EVTYPE, 10) #get the names of the top 10 from the EVTYPE variable
Top10HDmg <- filter(HlthSummary, EVTYPE %in% Top10HType) #use these names to filter the data
Top10HDat <- filter(IData, EVTYPE %in% Top10HType)
We can now preview the top 10 most dangerous storms to health:
head(HlthSummary, 10)
## Source: local data frame [10 x 4]
##
## EVTYPE FATALITIES INJURIES TOTINJFAT
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
Finally, we can use the reshape2 package to melt the data into a long format for graphing. This will convert our damage type and injury variables into factor levels so we can easily view them on a stacked bar chart.
Top10HDmg.m <- melt(Top10HDmg[,1:3], id.vars='EVTYPE', variable.name = "HLTHDMGTYPE", value.name = "DMGNUMBER")
Top10HDat.m <- melt(Top10HDat[,c(1:3)], id.vars='EVTYPE', variable.name = "HLTHDMGTYPE", value.name = "DMGNUMBER")
Top10Dmg.m <- melt(Top10Dmg[,1:3], id.vars='EVTYPE', variable.name = "DMGTYPE", value.name = "DMGVALUE")
Top10Dat.m <- melt(Top10Dat[,c(1,8,9)], id.vars='EVTYPE', variable.name = "DMGTYPE", value.name = "DMGVALUE")
Now that we have adequately processed the data, we can view the graphical and tabular results. First we print out the top 10 list of storms that pose health risks and top 10 list that causes property damage. We can make sure these subsets agree with our earlier previews.
Top10HDmg
## Source: local data frame [10 x 4]
##
## EVTYPE FATALITIES INJURIES TOTINJFAT
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
Top10Dmg
## Source: local data frame [10 x 4]
##
## EVTYPE PROPDMG CROPDMG TOTALDMG
## 1 FLOOD 144657709800 5661968450 150319678250
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160480 414953110 57352113590
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732266720 3025954450 18758221170
## 6 FLASH FLOOD 16140811510 1421317100 17562128610
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927810 5022113500 8967041310
We can now graph our results using ggplot. This allows us to create a stacked bar chart that will break out the different levels of damage type and health risk, and superimpose a box-plot that will allow us to see outliers of extreme individual events on the same graph.
ggplot(Top10HDmg.m, aes(x = reorder(EVTYPE, DMGNUMBER), y = DMGNUMBER,fill=HLTHDMGTYPE)) +
geom_bar(stat="identity") +
geom_boxplot(data = Top10HDat.m, aes(x=EVTYPE, y=DMGNUMBER), show_guide = F) +
scale_y_continuous(labels = comma) +
coord_flip() + labs(y='Number of Individuals',x='Storm Type') +
guides(fill=guide_legend(title="Severe Points Per Set:\nUpper Points = Injuries\nLower Points = Fatalities",
title.position = "bottom")) +
scale_fill_manual(labels=c("Fatalities", "Injuries"),values = c("coral", "burlywood2")) +
ggtitle("Top 10 Most Dangerous Types of Storms to Health \nPoints Represent Individual Storms of Severe Magnitude") +
theme(plot.title = element_text(lineheight=.8, face="bold"))
ggplot(Top10Dmg.m, aes(x = reorder(EVTYPE, DMGVALUE), y = DMGVALUE,fill=DMGTYPE)) +
geom_bar(stat="identity") +
geom_boxplot(data = Top10Dat.m, aes(x=EVTYPE, y=DMGVALUE), show_guide = F) +
scale_y_continuous(labels = comma) +
coord_flip() + labs(y='Damage in USD',x='Storm Type') +
guides(fill=guide_legend(title="Severe Points Per Set:\nUpper Points = Crop\nLower Points = Property",
title.position = "bottom")) +
scale_fill_manual(labels=c("Property Damage", "Crop Damage"),values = c("coral", "burlywood2")) +
ggtitle("Top 10 Most Economically Damaging Types of Storms \nPoints Represent Individual Storms of Severe Magnitude") +
theme(plot.title = element_text(lineheight=.8, face="bold"))
Based on these results, we can easily see the top 10 most dangerous storm types to health and the top 10 most economically damaging storm types. These can be compared in descending order by magnitude based to evalute which types pose the most risk and require the greatest precaution. We can see that while Tornados have historically caused the most injuries and fatalities by a large margin, floods have far-and-away outstripped all other storm types in terms of overall property damage. It’s also interesting to note that while single-event severe magnitude occurances seem to be bunched fairly close together and lower on the scale of heath risk, there are signifant outliers in events that cause property damage that appear to make up a signifigant portion of the total damage. This would indicate that a few events in the Flood, Typhoon, Storm Surge, River Flood, and Ice Storm categories caused a large portion of the overall damage, and should be studied further.