In this report we aim to determine which weather phenomena have caused the greatest amount of damage to population health and the economy in the United States. To determine this, we used the National Oceanic and Atmospheric Admisitration (NOAA) Storm Database, which has data on weather events in the United States from 1950 to November of 2011. In order to determine which phenomena had the greatest impact on population health, we looked at which event types caused the most injuries and fatalities. In order to determine which phenomena had the greatest impact on the economy, we looked at which event types caused the most property and crop damage. From the data, we found that in the US, Tornados have done the most damage to population health, and Floods have done the most damage to the economy.
The data for this analysis was obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The storms in the database start in 1950 and end in November 2011. It should be noted that there are fewer records, and many more events for more recent years. This is fine, as it will weight the data so that more recent years are better represented in the data, which will reflect current climate conditions better. However, do keep in mind that because there is missing data from the early years, the sums obtained in this analysis should not be taken as total amounts of damage, but as minimum amounts of damage that set benchmarks for events relative to each other.
First, we read in the storm data from the compressed csv file (.csv.bz2).
stormData<-read.csv("repdata_data_StormData.csv.bz2")
We check the dimensions of the stormData dataset.
dim(stormData)
## [1] 902297 37
We can see that there are 902297 observations on 37 variables.
For the purpose of this analysis, we will not be using all 37 variables, as many of them will not serve any purpose in our analysis. Therefore, we can take a subset of the stormData containing only the variables we need, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.
relevantVars<-c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
subsetStormData<- stormData[relevantVars]
The event type (EVTYPE) data from the storm data is very sloppy. While the NOAA has made a listing of just 48 event types, this data set has 985 unique event types.
length(unique(subsetStormData$EVTYPE))
## [1] 985
There are many repeats and synonyms in this field. However, most of these event types do not have any injuries, fatalities, property damage, or crop damage associated with them. As these are the factors we are concerned with, it is impractical and computationally expensive to clean up all the EVTYPE data. Instead, the EVTYPES will be condensed after the data is simplified into which ones had any impact on population health and the economy. They will be simplified into just 14 different categories, as it looks like the NOAA classifications provide a little more granularity than is needed to see which event types are the most harmful (ex. Extreme Cold/Wind and Cold/Wind are separate NOAA categories). This will be done by categorizing EVTYPES that contain certain string fragments into 1 of the 14 classifications. After that, we will only be looking at the 10 with the most impact in each field.
The reason to do this is that this way we avoid spending time cleaning data that does not have any impact on our analysis, and we focus on the information that we actually need.
fragment<-c("AVALA","TORNAD", "THUNDERST","TSTM", "WINT", "FLOOD", "COLD", "SNOW","WIND", "ICE", "FOG", "FREEZ", "HURRICANE", "CURRENT", "HEAT")
full<-c("AVALANCHE","TORNADO", "THUNDERSTORM", "THUNDERSTORM", "WINTER WEATHER", "FLOOD", "COLD", "SNOW","STRONG WINDS", "ICE", "FOG", "FREEZING RAIN", "HURRICANE", "CURRENT", "EXCESSIVE HEAT")
Unlike Fatalities and Injuries, Property and Crop damage are not expressed in their pure form in the dataset. In dealing with Property Damage and Crop Damage, both are expressed in the original data set as numeric dollar values given a magnitude with either K (thousand), M(million), or B(billion).
unique(subsetStormData$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(subsetStormData$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
If we look at the unique values in columns PROPDMGEXP and CROPDMGEXP, we can see that apart from K, M, and B, there are invalid “garbage” values present as well. We can eliminate all values that will not be relevant to our analysis by subsetting only nonzero values of property and crop damage. Then, we subset records with valid magnitude indicators.
propertyStormData<-subset(subsetStormData, subsetStormData$PROPDMG>0)
propertyStormData<-subset(propertyStormData,
propertyStormData$PROPDMGEXP == "B" | propertyStormData$PROPDMGEXP == "M" | propertyStormData$PROPDMGEXP == "K")
cropStormData<-subset(subsetStormData, subsetStormData$CROPDMG>0)
cropStormData<-subset(cropStormData,
cropStormData$CROPDMGEXP == "B" | cropStormData$CROPDMGEXP == "M" | cropStormData$CROPDMGEXP == "m" | cropStormData$CROPDMGEXP == "K")
Next, we loop through both PROPDMG and CROPDMG and multiply each value by either a thousand, a million, or a billion depending on its magnitude indicator.
for(i in 1:length(propertyStormData$PROPDMG)){
if(propertyStormData$PROPDMGEXP[i] == "B"){
propertyStormData$PROPDMG[i] <- (propertyStormData$PROPDMG[i]*1000000000)
}
if(propertyStormData$PROPDMGEXP[i] == "M"){
propertyStormData$PROPDMG[i] <- (propertyStormData$PROPDMG[i]*1000000)
}
if(propertyStormData$PROPDMGEXP[i] == "K"){
propertyStormData$PROPDMG[i] <- (propertyStormData$PROPDMG[i]*1000)
}
}
for(i in 1:length(cropStormData$CROPDMG)){
if(cropStormData$CROPDMGEXP[i] == "B"){
cropStormData$CROPDMG[i] <- (cropStormData$CROPDMG[i]*1000000000)
}
if(cropStormData$CROPDMGEXP[i] == "M" | cropStormData$CROPDMGEXP[i] == "m"){
cropStormData$CROPDMG[i] <- (cropStormData$CROPDMG[i]*1000000)
}
if(cropStormData$CROPDMGEXP[i] == "K"){
cropStormData$CROPDMG[i] <- (cropStormData$CROPDMG[i]*1000)
}
}
To calculate the number of fatalities caused by any one event type, we split up the storm data by FATALITIES in each EVTYPE and sum the FATALITIES within each EVTYPE. As we are not concerned with events that did not cause any fatalities, we can remove them from the data.
fatalitiesByEVTYPE<-split(subsetStormData$FATALITIES,subsetStormData$EVTYPE)
fatalitiesByEVTYPE<-sapply(fatalitiesByEVTYPE, sum, na.rm = TRUE)
fatalitiesByEVTYPE<-as.data.frame(fatalitiesByEVTYPE)
fatalitiesByEVTYPE$EVTYPE<-row.names(fatalitiesByEVTYPE)
names(fatalitiesByEVTYPE)<-c("FATALITIES", "EVTYPE")
fatalitiesByEVTYPE<-fatalitiesByEVTYPE[order(-fatalitiesByEVTYPE$FATALITIES),]
fatalitiesByEVTYPE <- subset(fatalitiesByEVTYPE, fatalitiesByEVTYPE$FATALITIES>0)
Now that we have a list of EVTYPEs and the sum of FATALITIES for each type, we must combine similar/redundant EVTYPEs. We do this here instead of in the preprocessing as it is much less computationally expensive, and only filters the data relevant to our analysis.
for(i in 1:length(fragment)){
temp<-grepl(fragment[i], fatalitiesByEVTYPE$EVTYPE, ignore.case = TRUE)
for(j in 1:length(fatalitiesByEVTYPE$EVTYPE)){
if(temp[j]==TRUE){
fatalitiesByEVTYPE$EVTYPE[j]<-full[i]
}
}
}
Now that we have reduced the variety of EVTYPEs, we must once again sum the data by FATALITIES per each EVTYPE.
fatalitiesByEVTYPE<-split(fatalitiesByEVTYPE$FATALITIES,fatalitiesByEVTYPE$EVTYPE)
fatalitiesByEVTYPE<-sapply(fatalitiesByEVTYPE, sum, na.rm = TRUE)
fatalitiesByEVTYPE<-as.data.frame(fatalitiesByEVTYPE)
fatalitiesByEVTYPE$EVTYPE<-row.names(fatalitiesByEVTYPE)
names(fatalitiesByEVTYPE)<-c("FATALITIES", "EVTYPE")
fatalitiesByEVTYPE<-fatalitiesByEVTYPE[order(-fatalitiesByEVTYPE$FATALITIES),]
row.names(fatalitiesByEVTYPE)<-NULL
We finally have an ordered list giving us which weather phenomena have caused the greatest number of fatalities within the US between 1950 and November of 2011. As we are primarily concerned with what is causing the most damage, we will view the top 10 most damaging event types
top10Fatalities<-head(fatalitiesByEVTYPE, 10)
top10Fatalities
## FATALITIES EVTYPE
## 1 5661 TORNADO
## 2 3138 EXCESSIVE HEAT
## 3 1525 FLOOD
## 4 816 LIGHTNING
## 5 729 THUNDERSTORM
## 6 577 CURRENT
## 7 471 STRONG WINDS
## 8 451 COLD
## 9 279 WINTER WEATHER
## 10 225 AVALANCHE
As we can see, the top three weather event types that have caused the most fatalities are Tornados (5661 fatalities), Excessive Heat (3138 fatalities), and Floods (1525 fatalities). Following these three are Lightning, Thunderstorms, and Currents which killed between 500 and 1000 people, and Strong Winds, Cold, Winter Weather, and Avalanches which killed less than 500 people.
To calculate the number of injuries caused by any one event type, we use the same process we did to calculate fatalities, except we use data from the INJURIES column instead. We split up the storm data by INJURIES in each EVTYPE and sum the INJURIES within each EVTYPE, we remove events that did not cause any injuries, combine similar/redundat EVTYPES, and sum the INJURIES by EVTYPE again.
injuriesByEVTYPE<-split(subsetStormData$INJURIES,subsetStormData$EVTYPE)
injuriesByEVTYPE<-sapply(injuriesByEVTYPE, sum, na.rm = TRUE)
injuriesByEVTYPE<-as.data.frame(injuriesByEVTYPE)
injuriesByEVTYPE$EVTYPE<-row.names(injuriesByEVTYPE)
names(injuriesByEVTYPE)<-c("INJURIES", "EVTYPE")
injuriesByEVTYPE<-injuriesByEVTYPE[order(-injuriesByEVTYPE$INJURIES),]
injuriesByEVTYPE <- subset(injuriesByEVTYPE, injuriesByEVTYPE$INJURIES>0)
for(i in 1:length(fragment)){
temp<-grepl(fragment[i], injuriesByEVTYPE$EVTYPE, ignore.case = TRUE)
for(j in 1:length(injuriesByEVTYPE$EVTYPE)){
if(temp[j]==TRUE){
injuriesByEVTYPE$EVTYPE[j]<-full[i]
}
}
}
injuriesByEVTYPE<-split(injuriesByEVTYPE$INJURIES,injuriesByEVTYPE$EVTYPE)
injuriesByEVTYPE<-sapply(injuriesByEVTYPE, sum, na.rm = TRUE)
injuriesByEVTYPE<-as.data.frame(injuriesByEVTYPE)
injuriesByEVTYPE$EVTYPE<-row.names(injuriesByEVTYPE)
names(injuriesByEVTYPE)<-c("INJURIES", "EVTYPE")
injuriesByEVTYPE<-injuriesByEVTYPE[order(-injuriesByEVTYPE$INJURIES),]
row.names(injuriesByEVTYPE)<-NULL
We once again are mainly concerned with which events are causing the most damage to population health, so we look at the top 10 injury causing events.
top10Injuries<-head(injuriesByEVTYPE, 10)
top10Injuries
## INJURIES EVTYPE
## 1 91407 TORNADO
## 2 9544 THUNDERSTORM
## 3 9224 EXCESSIVE HEAT
## 4 8604 FLOOD
## 5 5230 LIGHTNING
## 6 2152 ICE
## 7 1968 WINTER WEATHER
## 8 1896 STRONG WINDS
## 9 1361 HAIL
## 10 1328 HURRICANE
Tornados take the number 1 spot for the most injuries caused by a large margin, having caused 91407 injuries. The next most injurious events are Thunderstorms (9544), Excessive Heat (9224), Flood (8608), Lightning (5230), and Ice (2152). The remaining events, Winter Weather, Strong Winds, Hail, and Hurricanes all caused less than 2000 injuries.
From our previous results, we can see that:
Most Fatalities = Tornados (5661)
Most Injuries = Tornados (91407)
It is clear from this that Tornados are the most harmful event type with respect to population health.
However, if we wish to see which other event type causes the next most health damage, we should combine the fatality and injury amounts of the top 10 fatality and injury causing events. We must keep in mind that fatalities are of course more severe than an injury. However, for the sake of analysis, we will be giving both fatalities and injuries an equal weight.
names(top10Fatalities)<-c("DMG", "EVTYPE")
names(top10Injuries)<-c("DMG", "EVTYPE")
allHealth<-rbind(top10Fatalities, top10Injuries)
Just as we have done previously, we can now consolidate EVTYPEs and generate the list of top 10 health damaging event types.
for(i in 1:length(fragment)){
temp<-grepl(fragment[i], allHealth$EVTYPE, ignore.case = TRUE)
for(j in 1:length(allHealth$EVTYPE)){
if(temp[j]==TRUE){
allHealth$EVTYPE[j]<-full[i]
}
}
}
allHealth<-split(allHealth$DMG,allHealth$EVTYPE)
allHealth<-sapply(allHealth, sum, na.rm = TRUE)
allHealth<-as.data.frame(allHealth)
allHealth$EVTYPE<-row.names(allHealth)
names(allHealth)<-c("DMG", "EVTYPE")
allHealth<-allHealth[order(-allHealth$DMG),]
row.names(allHealth)<-NULL
top10allHealth<-head(allHealth, 10)
top10allHealth
## DMG EVTYPE
## 1 97068 TORNADO
## 2 12362 EXCESSIVE HEAT
## 3 10273 THUNDERSTORM
## 4 10129 FLOOD
## 5 6046 LIGHTNING
## 6 2367 STRONG WINDS
## 7 2247 WINTER WEATHER
## 8 2152 ICE
## 9 1361 HAIL
## 10 1328 HURRICANE
barplot(top10allHealth$DMG, names.arg = top10allHealth$EVTYPE, las=2, ylab = "People Hurt/Killed", main = "Overall Weather Damage to Population Health", cex.names=0.5, cex.axis=0.5)
As can be seen in the plot, Tornados are by far the most damaging event type to population health. The other events in the top 10 are still harmful to people, but do not pose a threat nearly as great as tornados do in the US.
During the preprocessing phase, we eliminated the need for the PROPDMGEXP variable and converted the values in the PROPDMG field to their full numeric value. We use the same method we have used to calculate total injuries and fatalities to calculate the total amount of property damage per event type.
pdmgByEVTYPE<-split(propertyStormData$PROPDMG,propertyStormData$EVTYPE)
pdmgByEVTYPE<-sapply(pdmgByEVTYPE, sum, na.rm = TRUE)
pdmgByEVTYPE<-as.data.frame(pdmgByEVTYPE)
pdmgByEVTYPE$EVTYPE<-row.names(pdmgByEVTYPE)
names(pdmgByEVTYPE)<-c("PROPDMG", "EVTYPE")
pdmgByEVTYPE<-pdmgByEVTYPE[order(-pdmgByEVTYPE$PROPDMG),]
for(i in 1:length(fragment)){
temp<-grepl(fragment[i], pdmgByEVTYPE$EVTYPE, ignore.case = TRUE)
for(j in 1:length(pdmgByEVTYPE$EVTYPE)){
if(temp[j]==TRUE){
pdmgByEVTYPE$EVTYPE[j]<-full[i]
}
}
}
pdmgByEVTYPE<-split(pdmgByEVTYPE$PROPDMG,pdmgByEVTYPE$EVTYPE)
pdmgByEVTYPE<-sapply(pdmgByEVTYPE, sum, na.rm = TRUE)
pdmgByEVTYPE<-as.data.frame(pdmgByEVTYPE)
pdmgByEVTYPE$EVTYPE<-row.names(pdmgByEVTYPE)
names(pdmgByEVTYPE)<-c("PROPDMG", "EVTYPE")
pdmgByEVTYPE<-pdmgByEVTYPE[order(-pdmgByEVTYPE$PROPDMG),]
row.names(pdmgByEVTYPE)<-NULL
We once again are mainly concerned with which events are causing the most damage to property, so we look at the top 10 property damaging events.
top10pdmg<-head(pdmgByEVTYPE, 10)
top10pdmg
## PROPDMG EVTYPE
## 1 167529215320 FLOOD
## 2 84636180010 HURRICANE
## 3 58581597730 TORNADO
## 4 43323536000 STORM SURGE
## 5 15727366720 HAIL
## 6 10973619030 THUNDERSTORM
## 7 7703890550 TROPICAL STORM
## 8 6777307750 WINTER WEATHER
## 9 6185735990 STRONG WINDS
## 10 4765114000 WILDFIRE
We can see from the results that the event type most damaging to property is Floods at an astounding $167,529,215,320, nearly double the next highest event Hurricanes ($84,636,180,010). These are followed by Tornados ($58,581,597,730), Storm Surges ($43,323,536,000), Hail ($15,727,366,720), and Thunderstorms($10,973,619,030), which did significant damage, but not quite as much as individual event types. The remaining events, Tropical Storms, Winter Weather, Strong Winds, and Wildfire all did less than $10,000,000,000 in property damage. While these amounts seem very high, do keep in mind that this is the amount of damage done over the course of many decades.
During the preprocessing phase, we eliminated the need for the CROPDMGEXP variable and converted the values in the CPROPDMG field to their full numeric value. We use the same method we have used to calculate total injuries and fatalities to calculate the total amount of crop damage per event type.
cdmgByEVTYPE<-split(cropStormData$CROPDMG,cropStormData$EVTYPE)
cdmgByEVTYPE<-sapply(cdmgByEVTYPE, sum, na.rm = TRUE)
cdmgByEVTYPE<-as.data.frame(cdmgByEVTYPE)
cdmgByEVTYPE$EVTYPE<-row.names(cdmgByEVTYPE)
names(cdmgByEVTYPE)<-c("CROPDMG", "EVTYPE")
cdmgByEVTYPE<-cdmgByEVTYPE[order(-cdmgByEVTYPE$CROPDMG),]
for(i in 1:length(fragment)){
temp<-grepl(fragment[i], cdmgByEVTYPE$EVTYPE, ignore.case = TRUE)
for(j in 1:length(cdmgByEVTYPE$EVTYPE)){
if(temp[j]==TRUE){
cdmgByEVTYPE$EVTYPE[j]<-full[i]
}
}
}
cdmgByEVTYPE<-split(cdmgByEVTYPE$CROPDMG,cdmgByEVTYPE$EVTYPE)
cdmgByEVTYPE<-sapply(cdmgByEVTYPE, sum, na.rm = TRUE)
cdmgByEVTYPE<-as.data.frame(cdmgByEVTYPE)
cdmgByEVTYPE$EVTYPE<-row.names(cdmgByEVTYPE)
names(cdmgByEVTYPE)<-c("CROPDMG", "EVTYPE")
cdmgByEVTYPE<-cdmgByEVTYPE[order(-cdmgByEVTYPE$CROPDMG),]
row.names(cdmgByEVTYPE)<-NULL
We once again are mainly concerned with which events are causing the most damage to crops, so we look at the top 10 crop damaging events.
top10cdmg<-head(cdmgByEVTYPE, 10)
top10cdmg
## CROPDMG EVTYPE
## 1 13972566000 DROUGHT
## 2 12380069100 FLOOD
## 3 5505292800 HURRICANE
## 4 5022114300 ICE
## 5 3025537450 HAIL
## 6 1889061000 FREEZING RAIN
## 7 1416765500 COLD
## 8 1271704900 THUNDERSTORM
## 9 904469280 EXCESSIVE HEAT
## 10 777875550 STRONG WINDS
We can see that Drought ($13,972,566,000) and Floods ($12,380,069,100) cause the most damage to crops. These are followed by Hurricanes ($5,505,292,800), Ice ($5,022,114,300), Hail ($3,025,537,450), Freezing Rain ($1,889,061,000), Cold ($1,416,765,500), and Thunderstorms ($1,271,704,900), which did significant damage, but not quite as much as individual event types. The remaining events, Excessive Heat and Strong Winds, both did less than $1,000,000,000 in crop damage. Once again, we need to remember that while these amounts seem very high, this is the amount of damage done over the course of many decades.
From our previous results, we can see that:
Highest Property Damage = Floods ($167,529,215,320)
Highest Crop Damage = Drought ($13,972,566,000)
However, if we wish to see which event type causes the most damage and has the greatest impact on the economy, we should combine the dollar amounts of the top 10 Property and Crop damaging events.
names(top10pdmg)<-c("DMG", "EVTYPE")
names(top10cdmg)<-c("DMG", "EVTYPE")
allDmg<-rbind(top10pdmg, top10cdmg)
Just as we have done previously, we can now consolidate EVTYPEs and generate the list of top 10 financially damaging event types.
for(i in 1:length(fragment)){
temp<-grepl(fragment[i], allDmg$EVTYPE, ignore.case = TRUE)
for(j in 1:length(allDmg$EVTYPE)){
if(temp[j]==TRUE){
allDmg$EVTYPE[j]<-full[i]
}
}
}
allDmg<-split(allDmg$DMG,allDmg$EVTYPE)
allDmg<-sapply(allDmg, sum, na.rm = TRUE)
allDmg<-as.data.frame(allDmg)
allDmg$EVTYPE<-row.names(allDmg)
names(allDmg)<-c("DMG", "EVTYPE")
allDmg<-allDmg[order(-allDmg$DMG),]
row.names(allDmg)<-NULL
top10alldmg<-head(allDmg, 10)
top10alldmg
## DMG EVTYPE
## 1 179909284420 FLOOD
## 2 90141472810 HURRICANE
## 3 58581597730 TORNADO
## 4 43323536000 STORM SURGE
## 5 18752904170 HAIL
## 6 13972566000 DROUGHT
## 7 12245323930 THUNDERSTORM
## 8 7703890550 TROPICAL STORM
## 9 6963611540 STRONG WINDS
## 10 6777307750 WINTER WEATHER
top10alldmg$DMG<-(top10alldmg$DMG/1000000000)
barplot(top10alldmg$DMG, names.arg = top10alldmg$EVTYPE, las=2, ylab = "Damage (Billions of $)", main = "Overall Weather Damage to Economy", cex.names=0.5, cex.axis=0.5)
As can be seen in the plot, Floods are by far the most damaging event type to the economy. Hurricanes and Tornados also cause a lot of damage, although not as much as Floods. The other events listed have caused a lot of damage over the years, but not nearly as much as floods, hurricane, or tornados.