Synopsis

The following report shows the results from the analysis of data from the U.S National Oceanic and Atmospheric´s (NOAA) storm database. Using data from 1950 to 2011, the report shows that in the U.S., the most harmful events are Tornadoes, Excessive Heat, Flash Floods, Heat, and Lightning. Together, these events account for more than 10 thousand fatalities. As we will see, this is close to 70 percent of the total fatalities reported during the period for all events together. In terms of average costs per event, Heavy Rain or Severe Weather events are the ones with the highest impact (cost). One of such events can cost, on average, up to 2500 million dollars. Tornadoes, Hurricanes, Storm Surges, and Severe Thunderstorms are the types of events that also show a high impact in terms of average cost per event.

Data Procesing

This analysis uses data from the U.S. National Oceanic and Atmospheric`s (NOAA) storm database. The database tracks major storms and weather events and their results (estimates) in terms of fatalities, injuries, and property damage. For detailed documentation please refer to the National Weater Service Storm Data Documentation. The data comes in a zip file and can be dowloaded from the course website.

# Download from Coursera´s website -last accessed on the date of this document-
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="stormData.csv.bz2", method="curl")

# Read data (read.table() works fine with compressed bzip2 files)
stormData <- read.table("stormData.csv.bz2", header=TRUE, sep=",")

After reading the data, we check for the number of rows and columns, and take a look at the classes assigned to those columns to make sure everything looks fine, and there is no need for additional processing.

library(dplyr)

# Rows & Cols
dim(stormData)
## [1] 902297     37
# Use glimpse() from the dplyr package which is in alternative to str()
glimpse(stormData)
## Observations: 902297
## Variables:
## $ STATE__    (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ BGN_DATE   (fctr) 4/18/1950 0:00:00, 4/18/1950 0:00:00, 2/20/1951 0:...
## $ BGN_TIME   (fctr) 0130, 0145, 1600, 0900, 1500, 2000, 0100, 0900, 20...
## $ TIME_ZONE  (fctr) CST, CST, CST, CST, CST, CST, CST, CST, CST, CST, ...
## $ COUNTY     (dbl) 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, ...
## $ COUNTYNAME (fctr) MOBILE, BALDWIN, FAYETTE, MADISON, CULLMAN, LAUDER...
## $ STATE      (fctr) AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL...
## $ EVTYPE     (fctr) TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ BGN_RANGE  (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ BGN_AZI    (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ BGN_LOCATI (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_DATE   (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_TIME   (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ COUNTY_END (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COUNTYENDN (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ END_RANGE  (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ END_AZI    (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_LOCATI (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ LENGTH     (dbl) 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, ...
## $ WIDTH      (dbl) 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400...
## $ F          (int) 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, ...
## $ MAG        (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FATALITIES (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES   (dbl) 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG    (dbl) 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP (fctr) K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG    (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ WFO        (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ STATEOFFIC (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ ZONENAMES  (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ LATITUDE   (dbl) 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 333...
## $ LONGITUDE  (dbl) 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 874...
## $ LATITUDE_E (dbl) 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, ...
## $ LONGITUDE_ (dbl) 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, ...
## $ REMARKS    (fctr) , , , , , , , , , , , , , , , , , , , , , , , , 
## $ REFNUM     (dbl) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...

There are 902297 observations (events in the U.S. from 1950 to 2011). We now proceed with some additional processing in order to obtain a suitable analytical file. The expense data is divided in two columns. One of them has the amount, and the second has the unit (e.g. “K” for thousands, “M” for millions, and so on). Therefore, we need to make some additional processing prior to reporting the most expensive types of events across all U.S. First, we select only those columns of interest: the event type, the number of fatalities, and the composite expense columns for properties and crops. Second, we divide the dataset in two subsets, one with all rows and a second one that filters only rows that contain events with expense reports that have the units stated in the documentation (“K”, “M”, and “B”). Third, in that latter file we create two new columns with the corresponding multipliers. Rows with values other than the ones reported in the documentation, and the “?” sign for missing values are assumed to be equal to 1. These values are not significant in terms of the total number of records or the value (expenses) they represent. Therefore, we only focus on those expenses reported with the appropriate units, as documented in the website. In the next chunk of code, we obtain two datasets, one with all rows (stormData) which also includes the FATALITIES column, and a second one called stormDtaTemp which is a subset of the first one and only include rows for which the expense units are the ones mentioned in the documentation file (see link).

# Select only columns of interest
stormData <- select(stormData, EVTYPE, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Filter only expense data that contains units in "K", "M", or "B"
units <- c("k", "K", "m", "M", "b", "B")
stormDtaTemp <- filter(stormData, PROPDMGEXP %in% units | CROPDMGEXP %in% units)

# Transform units columns to multipliers
stormDtaTemp$PROPDMGEXP <- as.factor(as.character(stormDtaTemp$PROPDMGEXP))
stormDtaTemp$CROPDMGEXP <- as.factor(as.character(stormDtaTemp$CROPDMGEXP))
stormDtaTemp$multPropExp <- 1
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="?"] <- NA
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="k"] <- 1000
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="K"] <- 1000
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="m"] <- 1000000
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="M"] <- 1000000
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="b"] <- 1000000000
stormDtaTemp$multPropExp[stormDtaTemp$PROPDMGEXP=="B"] <- 1000000000
stormDtaTemp$multCropExp <- 1
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="?"] <- NA
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="k"] <- 1000
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="K"] <- 1000
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="m"] <- 1000000
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="M"] <- 1000000
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="b"] <- 1000000000
stormDtaTemp$multCropExp[stormDtaTemp$CROPDMGEXP=="B"] <- 1000000000

# Create new columns for economic results of each event
stormDtaTemp <- mutate(stormDtaTemp, property=PROPDMG*multPropExp,
                       crops=CROPDMG*multCropExp, totalExpense=property+crops)

Results

Across the United States, which types of events are most harmful with respect to population health?

A quick look at the levels of the column of interest shows us that there are more than 980 factors in this column. We now use the verbs in the package dplyr in order to create a temporary table with the cumulative fatalities for each type of event during the period from 1950 to 2011. Notice that we are removing all NA from the analysis. We only report the top 20 events. All other events are aggregated into a single “OTHER” class.

# Aggregate fatalities by type of event
temp1 <- summarize(group_by(stormData, EVTYPE), totalFatalities=sum(FATALITIES,na.rm=T))

# Order temp. table and filter only top 20
temp1 <- arrange(temp1, desc(totalFatalities))
temp2 <- temp1[1:20,]; temp3 <- temp1[21:dim(temp1)[1],]
temp3$EVTYPE <- "OTHER"
temp3 <- summarize(group_by(temp3,EVTYPE), totalFatalities=sum(totalFatalities))
temp4 <- rbind(temp2,temp3)
library(ggplot2)
p1 <- ggplot(temp4, aes(x=reorder(EVTYPE,totalFatalities), y=totalFatalities))
p1 + geom_bar(stat='identity') + coord_flip() +
        labs(title="Total Fatalities by Type of Event 1950-2011\nStorms & Weather Events in the US") +
        labs(y="Total Fatalities", x="Type of Event") +
        theme_bw(base_family="Times", base_size=12)

The top 5 events in terms of fatalities (during the period 1950-2011 -see previous graph-) are Tornadoes, Excessive Heat, Flash Floods, Heat, and Lightning. Together, these events account for more than 10 thousand fatalities. These five types of events represent 67.79 percent of the total fatalities observed during the period in all types of events.

Across the United States, which types of events have the greatest economic consequences?

Using the processed and filtered data (stormDtaTemp file), we can now look at the total costs associated with each type of event.

# Aggregate expenses by type of event
temp1 <- summarize(group_by(stormDtaTemp, EVTYPE), totalExpense=sum(totalExpense,na.rm=T))
temp1 <- mutate(temp1, totalExpense2=totalExpense/1000000)

# Order temp. table and filter only top 20 (total Expense)
temp1 <- arrange(temp1, desc(totalExpense2))
temp2 <- temp1[1:20,]; temp3 <- temp1[21:dim(temp1)[1],]
temp3$EVTYPE <- "OTHER"
temp3 <- summarize(group_by(temp3,EVTYPE), totalExpense=sum(totalExpense),
                   totalExpense2=sum(totalExpense2))
temp4 <- rbind(temp2,temp3)
p1 <- ggplot(temp4, aes(x=reorder(EVTYPE,totalExpense2), y=totalExpense2))
p1 + geom_bar(stat='identity') + coord_flip() +
        labs(title="Total Economic Costs by Type of Event 1950-2011\nStorms & Weather Events in the US") +
        labs(y="Total Expenses (Millions of USD)", x="Type of Event") +
        theme_bw(base_family="Times", base_size=12)

We can observe a large difference between the top type of event (total cost) and the rest of the events. This might be due to the presence of a type of event that is recurrent (floods in this case) or a single event with a very high-impact. In order to control for such possibilities we now look at the average total cost for each type of event. In the following graph, we will analyze the average cost for event in any type of event.

# Summarize (average) expenses by type of event
temp1 <- summarize(group_by(stormDtaTemp, EVTYPE), avgExpense=mean(totalExpense,na.rm=T))
temp1 <- mutate(temp1, avgExpense2=avgExpense/1000000)

# Order temp. table
temp1 <- arrange(temp1, desc(avgExpense2))
p1 <- ggplot(temp1[1:20,], aes(x=reorder(EVTYPE,avgExpense2), y=avgExpense2))
p1 + geom_bar(stat='identity') + coord_flip() +
        labs(title="Average Economic Costs by Event by Type of Event 1950-2011\nStorms & Weather Events in the US") +
        labs(y="Average Expenses (Millions of USD)", x="Type of Event") +
        theme_bw(base_family="Times", base_size=12)

On average, Heavy Rain or Severe Weather events are the ones with the highest impact (cost). One of such events can cost, on average, up to 2500 million dollars. Tornadoes, Hurricanes, Storm Surges, and Severe Thunderstorms are the types of events that also show a high impact in terms of average cost per event. However, other than the top three events (in terms of average costs), all events show an average cost less than 500 million dollars.