This analysis uses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. In this database characteristics of major storms and weather events in the United States are recorded together with the information when and where they occur, estimates of any fatalities, injuries, and property damage. More information about the data can be found in this document: Storm Data Documentation
We examine which types of weather events are most harmful with respect to population health and which types of weather events have the greatest economic consequences. The data describes events from 1950 to 2011. Since there are considerably less weather events recorded in the earlier years, we limit our analysis to those years with at least 20,000 recorded events, which results in the years 1994 - 2011.
The next section Data Processing describes how the raw data was processed for analysis, the results of this analysis can be found in the Results section of this paper.
This section describes the processing of the raw data and the steps of the analysis in detail.
First, some packages need to be loaded:
library(lubridate)
library(ggplot2)
library(dplyr)
Also, the data is loaded from a compressed csv.bz2 file:
data <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)
Let’s see how much data we have:
dim(data)
## [1] 902297 37
Let’s take a look at the 37 columns:
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We know we have data from the years 1950 to 2011. To see if it makes sense to include all years or select the more recent years only, we take a look at how many events we have recorded in the data. The date of each event is stored in BGN_DATE.
years <- year(mdy_hms(data$BGN_DATE))
table(years)
## years
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961
## 223 269 272 492 609 1413 1703 2184 2213 1813 1945 2246
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## 2389 1968 2348 2855 2388 2688 3312 2926 3215 3471 2168 4463
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
## 5386 4975 3768 3728 3657 4279 6146 4517 7132 8322 7335 7979
## 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## 8726 7367 7257 10410 10946 12522 13534 12607 20631 27970 32270 28680
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 45817
## 2010 2011
## 48161 62174
So for the earlier years we have very few information about events compared to the more recent years. We decide to only consider the years 1994 to 2011 for this analysis. This corresponds to years with 20,000 events or more. Also we won’t need all 37 columns of the original dataset. So next, we create the correspondent subset with the data we intend to use for the analysis.
events <- subset(data, years %in% 1994:2011, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
The columns FATALITIES AND INJURIES contain information about the impact on population health. PROPDMG amd CROPDMG contain information about property damage and crop damage respectively. PROPDMGEXP and CROPDMGEXP signify the magnitude: “K” for thousands, “M” for millions, and “B” for billions. Unfortunately there are other entries, numerical and alphabetical, in these columns. For the economic impact analysis we decided to ignore events with invalid entries in these columns. We also igore the impact if the magnitude column is empty, as we are looking for events with high impacts.
events$PROPDMGEXP <- ifelse(events$PROPDMGEXP %in% c("K", "M", "B"), events$PROPDMGEXP, 0)
events$PROPDMGEXP<- factor(events$PROPDMGEXP,labels = c(0, 1e+09, 1e03, 1e06)) ## ordered alphabetically 0, B, K, M
events$CROPDMGEXP <- ifelse(events$CROPDMGEXP %in% c("K", "M", "B"), events$CROPDMGEXP, 0)
events$CROPDMGEXP<- factor(events$CROPDMGEXP,labels = c(0, 1e+09, 1e03, 1e06)) ## ordered alphabetically 0, B, K, M
## Multiply economic impacts with their magnitude
events$PROPDMG <- events$PROPDMG * as.numeric(events$PROPDMGEXP)
events$CROPDMG <- events$CROPDMG * as.numeric(events$CROPDMGEXP)
We group this dataset by type of event to sum up the impact.
byEvent <- group_by(events, EVTYPE)
impact <- summarize(byEvent, human = sum(FATALITIES) + sum(INJURIES),
economic = sum(PROPDMG) + sum(CROPDMG))
Sort for human impact and take first 10 entries:
human_sort <- head(impact[order(-impact$human), ], 10)
Sort for economic impact and take first 10 entries:
economic_sort <- head(impact[order(-impact$economic), ], 10)
g <- ggplot(human_sort, aes(EVTYPE, human))
p <- g + geom_bar(aes(fill = EVTYPE), stat = "identity") +
labs(title = "Types of Events with Biggest Impact on Humans \n") +
labs(x = "Type of Event", y = "Impact (Sum of Injuries and Fatalities)")
print(p)
g <- ggplot(human_sort, aes(EVTYPE, economic))
p <- g + geom_bar(aes(fill = EVTYPE), stat = "identity") +
labs(title = "Types of Events with Biggest Economic Impact \n") +
labs(x = "Type of Event", y = "Impact (Sum of Damages to Properties and Crops in US$)")
print(p)
The first plot shows the ten types of severe weather events that have the biggest impact on human health according to the records from 1994 to 2011. Tornados are by far the weather event that causes the most injuries and fatalities, followed by excessive heat and floods.
The second plot displays the ten types of severe weather events that have the biggest economic impact. Here we see that flash floods cause the biggest property and crop damage, closely followed by tornados and thunderstorms. When looking at the different event types in the plot, it can be seen that the type thunderstorm winds actually occurrs twice, under different names (TSTM WIND and THUNDERSTORM WIND). This is due to the fact that we left the event types as they were defined by the raw dataset.
It could also make sense to group flash floods and floods together. With these alterations, floods would still be the event type causing the biggest damage, followed by thunderstorms and tornados.