Synopsis

Severe weather events like tornados and hurricanes cause annually a lot of fatalities, injuries and property damage, whereas floods often result in substantial losses for the crop industry. Investigating further into which weather events cause the worst damage may be of some help in allocating public money the best way possible.
The goal of this analysis is to shed some light on the question, which weather events have the worst impact on human health (fatalities and injuries) and physical damage (property damage and crop damage) in the United States. The data used can be obtained on the website of the U.S. Nationa Oceanic and Atmospheric Administration (NOAA). As the data was poorly recorded in the first decades, there was the decision to consider only those weather events recorded from 1990 onwards. As regards the impact on human health, tornados, heat, wind, and floods turn out to be the top four events in both number of fatalities and injuries. It should be noted that heat (an event with poor media coverage!) is leading table of fatalities, by far. Regarding physical damage, tornados, wind, and floods play a substantial role, too. However, hail turns out to be another serious weather event, especially resulting in a huge loss of crop.

Data processing

First of all, the data has to be read into R. Then, some packages for the data analysis are loaded.

weather <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
library(dplyr)
library(ggplot2)

As we know, that data from earlier decades may be somewhat sparse, we decided to look at the distribution of data over the years first. As to do so, we had to manipulate the vector BGN_DATE.

weather$BGN_DATE <- as.character(weather$BGN_DATE)
splitNames <- strsplit(weather$BGN_DATE, "\\ ")
splitNames[1]
## [[1]]
## [1] "4/18/1950" "0:00:00"
splitNames[[1]][[1]]
## [1] "4/18/1950"
firstElement <- function(x){x[1]}
weather$Date <- sapply(splitNames, firstElement)
splitNames <- strsplit(weather$Date, "\\/")
splitNames[[1]][[3]]
## [1] "1950"
ThirdElement <- function(x){x[3]}
weather$Date <- sapply(splitNames, ThirdElement)

As one can see, the largest bulk of events has been recorded from the 1990s onwards, so we take these cases and go on with the analysis:

weather$Date <- as.numeric(weather$Date)
hist(weather$Date)
abline(v=median(weather$Date))

As we modify the data, the dplyr package comes in very handy. We filter the relevant observations, select the variables of interest and check for Na’s.

weather_90 <- weather %>% filter(weather$Date > 1989)
weather_90_healthdamage <- weather_90 %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, Date)
colSums(is.na(weather_90_healthdamage))
##     EVTYPE FATALITIES   INJURIES    PROPDMG    CROPDMG       Date 
##          0          0          0          0          0          0

Looking at the distribution of the event types and the numbers of their occurrences, we get to know that a couple of them are redundand (e.g. THST and thunderstorm). So before continuing the analysis, we have to aggregate the redundand variables. However, due to sheer amount of values, we decided to only care for those of them, who are of numerical relevance.

weather_90_healthdamage$EVTYPE <- toupper(weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*WIND.*", "WIND", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*HEAT.*", "HEAT", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*STORM.* | TSTM.* | THUNDERSTORM.*", "THUNDERSTORM", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*FLOOD.*", "FLOOD", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub("AVALAN.*", "AVALANCH", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*RIP CURRENTS.*", "RIP CURRENT", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*ICE STORM.*", "WINTER STORM", weather_90_healthdamage$EVTYPE)

Again, the dplyr package comes in very handy, as we group, summarise and arrange the modified data so as to get a data frame for all four categories (fatalities, injuries, property damage and crop damage).

Fat_Ev <- weather_90_healthdamage %>% group_by (EVTYPE) %>% summarise(Sum = sum(FATALITIES)) %>% arrange(desc(Sum))
Inj_Ev <- weather_90_healthdamage %>% group_by(EVTYPE) %>% summarise(Sum = sum(INJURIES)) %>% arrange(desc(Sum))
Prop_Ev <- weather_90_healthdamage %>% group_by(EVTYPE) %>% summarise(Sum = sum(PROPDMG)) %>% arrange(desc(Sum))
Crop_Ev <- weather_90_healthdamage %>% group_by(EVTYPE) %>% summarise(Sum = sum(CROPDMG)) %>% arrange(desc(Sum))

Results

Now, we get the 10 topmost occurrences for each category.

Top_Fat <- top_n(Fat_Ev, 10, Sum)
Top_Inj <- top_n(Inj_Ev, 10, Sum)
Top_Prop <- top_n(Prop_Ev, 10, Sum)
Top_Crop <- top_n(Crop_Ev, 10, Sum)
Top_Fat
## Source: local data frame [10 x 2]
## 
##          EVTYPE  Sum
## 1          HEAT 3138
## 2       TORNADO 1752
## 3         FLOOD 1525
## 4          WIND 1274
## 5     LIGHTNING  816
## 6   RIP CURRENT  577
## 7  WINTER STORM  295
## 8      AVALANCH  225
## 9  EXTREME COLD  162
## 10   HEAVY SNOW  127
Top_Inj
## Source: local data frame [10 x 2]
## 
##               EVTYPE   Sum
## 1            TORNADO 26674
## 2               WIND  9563
## 3               HEAT  9224
## 4              FLOOD  8604
## 5          LIGHTNING  5230
## 6       WINTER STORM  3311
## 7  HURRICANE/TYPHOON  1275
## 8               HAIL  1139
## 9         HEAVY SNOW  1021
## 10          WILDFIRE   911
Top_Prop
## Source: local data frame [10 x 2]
## 
##            EVTYPE        Sum
## 1            WIND 3134605.26
## 2           FLOOD 2435599.60
## 3         TORNADO 1588732.99
## 4            HAIL  688693.38
## 5       LIGHTNING  603351.78
## 6    WINTER STORM  199821.26
## 7      HEAVY SNOW  122251.99
## 8        WILDFIRE   84459.34
## 9      HEAVY RAIN   50842.14
## 10 TROPICAL STORM   48423.68
Top_Crop
## Source: local data frame [10 x 2]
## 
##            EVTYPE       Sum
## 1            HAIL 579596.28
## 2           FLOOD 364343.93
## 3            WIND 223939.84
## 4         TORNADO 100018.52
## 5         DROUGHT  33898.62
## 6      HEAVY RAIN  11122.80
## 7    FROST/FREEZE   7134.14
## 8    EXTREME COLD   6141.14
## 9  TROPICAL STORM   5899.12
## 10      HURRICANE   5339.31

We decide to make two plots, which are supposed to illustrate our findings. First, we apply an extra category vector; then, we rbind the four data frames into two.

Top_Fat$Category <- "Fatalities"
Top_Inj$Category <- "Injuries"
Top_Prop$Category <- "Property damage" 
Top_Crop$Category <- "Crop damage"
Top_Fat_Inj <- rbind(Top_Fat, Top_Inj)
Top_Prop_Crop <- rbind(Top_Prop, Top_Crop)

Finally, by using the ggplot2 package, we obtain the two plots; one for personal damage expressed in injuries and fatalities and one for physical damage expressed in property damage and crop damage.

g <- ggplot(Top_Fat_Inj, aes(x= reorder(EVTYPE, -Sum), Sum))
g <- g + geom_bar(stat="identity") + facet_grid(.~Category, scales = "free") + theme(axis.text.x = element_text(angle=90, hjust=1, size=8)) + xlab("Event")
g

f <- ggplot(Top_Prop_Crop, aes(x= reorder(EVTYPE, -Sum), Sum))
f <- f + geom_bar(stat="identity") + facet_grid(.~Category, scales = "free") + theme(axis.text.x = element_text(angle=90, hjust=1, size=8)) + xlab("Event")
f