Synopsis

This analysis uses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. In this database characteristics of major storms and weather events in the United States are recorded together with the information when and where they occur, estimates of any fatalities, injuries, and property damage. More information about the data can be found in this document: Storm Data Documentation

We examine which types of weather events are most harmful with respect to population health and which types of weather events have the greatest economic consequences. The data describes events from 1950 to 2011. Since there are considerably less weather events recorded in the earlier years, we limit our analysis to those years with at least 20,000 recorded events, which results in the years 1994 - 2011.

The next section Data Processing describes how the raw data was processed for analysis, the results of this analysis can be found in the Results section of this paper.

Data Processing

This section describes the processing of the raw data and the steps of the analysis in detail.
First, some packages need to be loaded:

library(lubridate)
library(ggplot2)
library(dplyr)

Also, the data is loaded from a compressed csv.bz2 file:

data <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)

Let’s see how much data we have:

dim(data)
## [1] 902297     37

Let’s take a look at the 37 columns:

names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We know we have data from the years 1950 to 2011. To see if it makes sense to include all years or select the more recent years only, we take a look at how many events we have recorded in the data. The date of each event is stored in BGN_DATE.

years <- year(mdy_hms(data$BGN_DATE)) 
table(years)
## years
##  1950  1951  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961 
##   223   269   272   492   609  1413  1703  2184  2213  1813  1945  2246 
##  1962  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973 
##  2389  1968  2348  2855  2388  2688  3312  2926  3215  3471  2168  4463 
##  1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985 
##  5386  4975  3768  3728  3657  4279  6146  4517  7132  8322  7335  7979 
##  1986  1987  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997 
##  8726  7367  7257 10410 10946 12522 13534 12607 20631 27970 32270 28680 
##  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009 
## 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 45817 
##  2010  2011 
## 48161 62174

So for the earlier years we have very few information about events compared to the more recent years. We decide to only consider the years 1994 to 2011 for this analysis. This corresponds to years with 20,000 events or more. Also we won’t need all 37 columns of the original dataset. So next, we create the correspondent subset with the data we intend to use for the analysis.

events <- subset(data, years %in% 1994:2011, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

The columns FATALITIES AND INJURIES contain information about the impact on population health. PROPDMG amd CROPDMG contain information about property damage and crop damage respectively. PROPDMGEXP and CROPDMGEXP signify the magnitude: “K” for thousands, “M” for millions, and “B” for billions. Unfortunately there are other entries, numerical and alphabetical, in these columns. For the economic impact analysis we decided to ignore events with invalid entries in these columns. We also igore the impact if the magnitude column is empty, as we are looking for events with high impacts.

events$PROPDMGEXP <- ifelse(events$PROPDMGEXP %in% c("K", "M", "B"), events$PROPDMGEXP, 0)
events$PROPDMGEXP<- factor(events$PROPDMGEXP,labels = c(0, 1e+09, 1e03, 1e06)) ## ordered alphabetically 0, B, K, M

events$CROPDMGEXP <- ifelse(events$CROPDMGEXP %in% c("K", "M", "B"), events$CROPDMGEXP, 0)
events$CROPDMGEXP<- factor(events$CROPDMGEXP,labels = c(0, 1e+09, 1e03, 1e06)) ## ordered alphabetically 0, B, K, M

## Multiply economic impacts with their magnitude

events$PROPDMG <- events$PROPDMG * as.numeric(events$PROPDMGEXP)
events$CROPDMG <- events$CROPDMG * as.numeric(events$CROPDMGEXP)

We group this dataset by type of event to sum up the impact.

byEvent <- group_by(events, EVTYPE)
impact <- summarize(byEvent, human = sum(FATALITIES) + sum(INJURIES), 
                            economic = sum(PROPDMG) + sum(CROPDMG))

Sort for human impact and take first 10 entries:

human_sort <- head(impact[order(-impact$human), ], 10)

Sort for economic impact and take first 10 entries:

economic_sort <- head(impact[order(-impact$economic), ], 10)

Results

g <- ggplot(human_sort, aes(EVTYPE, human))
p <- g + geom_bar(aes(fill = EVTYPE), stat = "identity") + 
          labs(title = "Types of Events with Biggest Impact on Humans \n") + 
          labs(x = "Type of Event", y = "Impact (Sum of Injuries and Fatalities)")
print(p)

plot of chunk unnamed-chunk-11

g <- ggplot(human_sort, aes(EVTYPE, economic))
p <- g + geom_bar(aes(fill = EVTYPE), stat = "identity") + 
          labs(title = "Types of Events with Biggest Economic Impact \n") + 
          labs(x = "Type of Event", y = "Impact (Sum of Damages to Properties and Crops in US$)")
print(p)

plot of chunk unnamed-chunk-11

The first plot shows the ten types of severe weather events that have the biggest impact on human health according to the records from 1994 to 2011. Tornados are by far the weather event that causes the most injuries and fatalities, followed by excessive heat and floods.

The second plot displays the ten types of severe weather events that have the biggest economic impact. Here we see that flash floods cause the biggest property and crop damage, closely followed by tornados and thunderstorms. When looking at the different event types in the plot, it can be seen that the type thunderstorm winds actually occurrs twice, under different names (TSTM WIND and THUNDERSTORM WIND). This is due to the fact that we left the event types as they were defined by the raw dataset.
It could also make sense to group flash floods and floods together. With these alterations, floods would still be the event type causing the biggest damage, followed by thunderstorms and tornados.