Severe weather events like tornados and hurricanes cause annually a lot of fatalities, injuries and property damage, whereas floods often result in substantial losses for the crop industry. Investigating further into which weather events cause the worst damage may be of some help in allocating public money the best way possible.
The goal of this analysis is to shed some light on the question, which weather events have the worst impact on human health (fatalities and injuries) and physical damage (property damage and crop damage) in the United States. The data used can be obtained on the website of the U.S. Nationa Oceanic and Atmospheric Administration (NOAA). As the data was poorly recorded in the first decades, there was the decision to consider only those weather events recorded from 1990 onwards. As regards the impact on human health, tornados, heat, wind, and floods turn out to be the top four events in both number of fatalities and injuries. It should be noted that heat (an event with poor media coverage!) is leading table of fatalities, by far. Regarding physical damage, tornados, wind, and floods play a substantial role, too. However, hail turns out to be another serious weather event, especially resulting in a huge loss of crop.
First of all, the data has to be read into R. Then, some packages for the data analysis are loaded.
weather <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
library(dplyr)
library(ggplot2)
As we know, that data from earlier decades may be somewhat sparse, we decided to look at the distribution of data over the years first. As to do so, we had to manipulate the vector BGN_DATE.
weather$BGN_DATE <- as.character(weather$BGN_DATE)
splitNames <- strsplit(weather$BGN_DATE, "\\ ")
splitNames[1]
## [[1]]
## [1] "4/18/1950" "0:00:00"
splitNames[[1]][[1]]
## [1] "4/18/1950"
firstElement <- function(x){x[1]}
weather$Date <- sapply(splitNames, firstElement)
splitNames <- strsplit(weather$Date, "\\/")
splitNames[[1]][[3]]
## [1] "1950"
ThirdElement <- function(x){x[3]}
weather$Date <- sapply(splitNames, ThirdElement)
As one can see, the largest bulk of events has been recorded from the 1990s onwards, so we take these cases and go on with the analysis:
weather$Date <- as.numeric(weather$Date)
hist(weather$Date)
abline(v=median(weather$Date))
As we modify the data, the dplyr package comes in very handy. We filter the relevant observations, select the variables of interest and check for Na’s.
weather_90 <- weather %>% filter(weather$Date > 1989)
weather_90_healthdamage <- weather_90 %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, Date)
colSums(is.na(weather_90_healthdamage))
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG Date
## 0 0 0 0 0 0
Looking at the distribution of the event types and the numbers of their occurrences, we get to know that a couple of them are redundand (e.g. THST and thunderstorm). So before continuing the analysis, we have to aggregate the redundand variables. However, due to sheer amount of values, we decided to only care for those of them, who are of numerical relevance.
weather_90_healthdamage$EVTYPE <- toupper(weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*WIND.*", "WIND", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*HEAT.*", "HEAT", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*STORM.* | TSTM.* | THUNDERSTORM.*", "THUNDERSTORM", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*FLOOD.*", "FLOOD", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub("AVALAN.*", "AVALANCH", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*RIP CURRENTS.*", "RIP CURRENT", weather_90_healthdamage$EVTYPE)
weather_90_healthdamage$EVTYPE <- gsub(".*ICE STORM.*", "WINTER STORM", weather_90_healthdamage$EVTYPE)
Again, the dplyr package comes in very handy, as we group, summarise and arrange the modified data so as to get a data frame for all four categories (fatalities, injuries, property damage and crop damage).
Fat_Ev <- weather_90_healthdamage %>% group_by (EVTYPE) %>% summarise(Sum = sum(FATALITIES)) %>% arrange(desc(Sum))
Inj_Ev <- weather_90_healthdamage %>% group_by(EVTYPE) %>% summarise(Sum = sum(INJURIES)) %>% arrange(desc(Sum))
Prop_Ev <- weather_90_healthdamage %>% group_by(EVTYPE) %>% summarise(Sum = sum(PROPDMG)) %>% arrange(desc(Sum))
Crop_Ev <- weather_90_healthdamage %>% group_by(EVTYPE) %>% summarise(Sum = sum(CROPDMG)) %>% arrange(desc(Sum))
Now, we get the 10 topmost occurrences for each category.
Top_Fat <- top_n(Fat_Ev, 10, Sum)
Top_Inj <- top_n(Inj_Ev, 10, Sum)
Top_Prop <- top_n(Prop_Ev, 10, Sum)
Top_Crop <- top_n(Crop_Ev, 10, Sum)
Top_Fat
## Source: local data frame [10 x 2]
##
## EVTYPE Sum
## 1 HEAT 3138
## 2 TORNADO 1752
## 3 FLOOD 1525
## 4 WIND 1274
## 5 LIGHTNING 816
## 6 RIP CURRENT 577
## 7 WINTER STORM 295
## 8 AVALANCH 225
## 9 EXTREME COLD 162
## 10 HEAVY SNOW 127
Top_Inj
## Source: local data frame [10 x 2]
##
## EVTYPE Sum
## 1 TORNADO 26674
## 2 WIND 9563
## 3 HEAT 9224
## 4 FLOOD 8604
## 5 LIGHTNING 5230
## 6 WINTER STORM 3311
## 7 HURRICANE/TYPHOON 1275
## 8 HAIL 1139
## 9 HEAVY SNOW 1021
## 10 WILDFIRE 911
Top_Prop
## Source: local data frame [10 x 2]
##
## EVTYPE Sum
## 1 WIND 3134605.26
## 2 FLOOD 2435599.60
## 3 TORNADO 1588732.99
## 4 HAIL 688693.38
## 5 LIGHTNING 603351.78
## 6 WINTER STORM 199821.26
## 7 HEAVY SNOW 122251.99
## 8 WILDFIRE 84459.34
## 9 HEAVY RAIN 50842.14
## 10 TROPICAL STORM 48423.68
Top_Crop
## Source: local data frame [10 x 2]
##
## EVTYPE Sum
## 1 HAIL 579596.28
## 2 FLOOD 364343.93
## 3 WIND 223939.84
## 4 TORNADO 100018.52
## 5 DROUGHT 33898.62
## 6 HEAVY RAIN 11122.80
## 7 FROST/FREEZE 7134.14
## 8 EXTREME COLD 6141.14
## 9 TROPICAL STORM 5899.12
## 10 HURRICANE 5339.31
We decide to make two plots, which are supposed to illustrate our findings. First, we apply an extra category vector; then, we rbind the four data frames into two.
Top_Fat$Category <- "Fatalities"
Top_Inj$Category <- "Injuries"
Top_Prop$Category <- "Property damage"
Top_Crop$Category <- "Crop damage"
Top_Fat_Inj <- rbind(Top_Fat, Top_Inj)
Top_Prop_Crop <- rbind(Top_Prop, Top_Crop)
Finally, by using the ggplot2 package, we obtain the two plots; one for personal damage expressed in injuries and fatalities and one for physical damage expressed in property damage and crop damage.
g <- ggplot(Top_Fat_Inj, aes(x= reorder(EVTYPE, -Sum), Sum))
g <- g + geom_bar(stat="identity") + facet_grid(.~Category, scales = "free") + theme(axis.text.x = element_text(angle=90, hjust=1, size=8)) + xlab("Event")
g
f <- ggplot(Top_Prop_Crop, aes(x= reorder(EVTYPE, -Sum), Sum))
f <- f + geom_bar(stat="identity") + facet_grid(.~Category, scales = "free") + theme(axis.text.x = element_text(angle=90, hjust=1, size=8)) + xlab("Event")
f