Synopsis

The basic goal of this report is to explore the NOAA Storm Database and answer the following questions about severe weather events:

  1. Which types of events are most harmful with respect to population health and where are they located in the US?
  2. Which types of events have the greatest economic consequences and where are they located in the US?

The report will show that tornados are most harmful to population health in the mainland US between 1993 and 2011. They cause 3 times more casualties than any other severe weather events. Floods have the greatest economic consequences. They are responsible for more than 150 Billion USD in property and crop damages since 1993. Texas, California and the southwest of the US are particularly exposed to severe weather events.

Data Processing

Loading, Subsetting & Date Conversion

We first load the data into R.

## Download the file.     
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "repdata-data-StormData.csv", method="curl")

## Reading and Subsetting the data
dt <- read.csv("./repdata-data-StormData.csv", sep=",", header=T, na.strings = "", stringsAsFactors=FALSE)
dataProcessing <- subset(dt, select=c( "STATE", "EVTYPE", "FATALITIES","INJURIES","PROPDMG", 
                             "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "BGN_DATE" ))
## Date and US State Formatting
library(lubridate)
dataProcessing$DATE <- as.Date(dataProcessing$BGN_DATE, format = "%m/%d/%Y" )
dataProcessing$STATENAME <- state.name[match(dataProcessing$STATE,state.abb)]
dataProcessing$YEAR <- substr(dataProcessing$DATE,1,4)
dataProcessing$YEAR <- as.numeric(dataProcessing$YEAR)
# 902297 obs. of  12 variables

Data Quality

In this section, we look at the total number of missing values and other potential sources of error. We then chose a strategy for imputing missing value and improving the data quality.

## Calculate the total number of missing value per column.
sapply(dataProcessing, function(x) sum(is.na(x)))
##      STATE     EVTYPE FATALITIES   INJURIES    PROPDMG PROPDMGEXP 
##          0          0          0          0          0     465934 
##    CROPDMG CROPDMGEXP   BGN_DATE       DATE  STATENAME       YEAR 
##          0     618413          0          0      19111          0
## We run a simple frequency analysis to look at the "damage" format and state name.
# sort(table(dataProcessing$STATENAME),decreasing=T)
sort(table(dataProcessing$PROPDMGEXP),decreasing=T)
## 
##      K      M      0      B      5      1      2      ?      m      H 
## 424665  11330    216     40     28     25     13      8      7      6 
##      +      7      3      4      6      -      8      h 
##      5      5      4      4      4      1      1      1
sort(table(dataProcessing$CROPDMGEXP),decreasing=T)
## 
##      K      M      k      0      B      ?      2      m 
## 281832   1994     21     19      9      7      1      1

The data is quite “dirty”. We choose the following strategy to improve the data quality:

  1. There is a lot of missing values in column PROPDMGEXP and CROPDMGEXP. We chose to replace it with a value of 1 not to impact on the human and economic damage measurements.
  2. After the transformation from state abbreviations to state names, there is a lot of state names missing due to regions outside the mainland US like Federal areas, Pacific islands, obsolete postal code and US military base. These “regions” are removed from the dataset. Our analysis will only look at US states in Northern America (including Hawaii).
  3. The data quality of PROPDMGEXP and CROPDMGEXP is poor and difficult to analyse. We will need to transform these variables from character to numeric values (see data transformation for details).
  4. We assume that more recent years should be considered more complete, hence it may be valuable to limit the dataset to include only recent years. We will evaluate the reliability of the dates in the last part of the pre-processing steps.
## Missing values
dataProcessing$CROPDMGSUM = 1
dataProcessing[is.na(dataProcessing$CROPDMGEXP), ]$CROPDMGEXP <- dataProcessing[is.na(dataProcessing$CROPDMGEXP), ]$CROPDMGSUM
dataProcessing$PROPDMGSUM = 1
dataProcessing[is.na(dataProcessing$PROPDMGEXP), ]$PROPDMGEXP <- dataProcessing[is.na(dataProcessing$PROPDMGEXP), ]$PROPDMGSUM
## Remove regions outside Northern America and Federal States
dataProcessing2 <- dataProcessing[!is.na(dataProcessing$STATENAME), ]
# 883186 obs. of  14 variables

Data Transformation

We transform and create new variables needed for analysing the damage due to severe weather storms.

transform <- function(x) {
  if (x=="H"|x=="h") y <- 100
  else if (x=="K"|x=="k") y <- 1000
  else if (x=="M"|x=="m") y <- 1000000
  else if (x=="B"|x=="b") y <- 1000000000
  else y <- 1
  y
  }
dataProcessing2$PROPDMGEXP <- sapply(dataProcessing2$PROPDMGEXP,transform)
dataProcessing2$CROPDMGEXP <- sapply(dataProcessing2$CROPDMGEXP,transform)

## Create Property and Crop Damage Variables for evaluating the economic impact.
dataProcessing2$DamageProperty <- dataProcessing2$PROPDMGEXP * dataProcessing2$PROPDMG
dataProcessing2$DamageCrops <- dataProcessing2$CROPDMGEXP * dataProcessing2$CROPDMG
dataProcessing2$TotalDamage <- dataProcessing2$DamageProperty + dataProcessing2$DamageCrops

## Create Casualty Variable for evaluating the population health impact.
dataProcessing2$Casualties <- dataProcessing2$FATALITIES + dataProcessing2$INJURIES

Creating the Analytical Dataset

We finally quantify the amount of analytical data needed for further data processing.

## Subsetting and quality assessment of the date (assuming that more recent years should be considered more complete.)
## dataProcessing3 <- dataProcessing2[dataProcessing2$FATALITIES > 0, ]
## summary(dataProcessing3$DATE)
dataProcessing4 <- dataProcessing2[dataProcessing2$DamageCrops > 0, ]
summary(dataProcessing4$DATE)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1993-01-06" "1997-08-14" "2001-09-20" "2002-05-08" "2007-07-03" 
##         Max. 
## "2011-11-28"

Conclusion: Information about crop damages is not available before 1993. There is also a similar issue for other severe weather events. Before 1993, the NOAA Storm Database only contains casualties history data for tornados (1950), hails (1983) and thunderstorm winds (1983). Considering the issue of comparability of data sources, we chose to only analyse this dataset from June 1993 to November 2011. All data prior to 01-06-1993 will be removed.

## Creating the analytical dataset
analyticalData <- dataProcessing2[dataProcessing2$DATE >= "1993-01-06", ]
# 695571 obs. of  18 variables

Finally, we create a new dataset for analysing the data. We beleive this dataset contains a higher data quality for further analysis and reporting.

Results

Most Harmful Events Across the US from 1993 to 2011

First, we look at which types of events are most harmful with respect to population health.

## Aggregate casualties by most harmful events
aggAnalytical <- with(analyticalData, aggregate(analyticalData$Casualties, by=list(analyticalData$EVTYPE), sum))
names(aggAnalytical)[names(aggAnalytical) == 'Group.1'] <- 'Harmful_Events'
names(aggAnalytical)[names(aggAnalytical) == 'x'] <- 'Nb_Casualties'

## Top 10 most harmful events (in terms of casualties)
agg <- aggAnalytical[order(-aggAnalytical$Nb_Casualties), ]
top <- agg[1:10, ]
top
##        Harmful_Events Nb_Casualties
## 811           TORNADO         24931
## 129    EXCESSIVE HEAT          8092
## 167             FLOOD          7250
## 451         LIGHTNING          6018
## 833         TSTM WIND          3855
## 272              HEAT          3035
## 152       FLASH FLOOD          2703
## 415         ICE STORM          2064
## 737 THUNDERSTORM WIND          1604
## 939      WINTER STORM          1526

Conclusion: Tornados are by far the most harmful events in terms of casualties, followed by excessive heat, floods, lightning and thunderstorm winds.

Secondly, we look at which states in the US had the most casualties between 1993 and 2011.

## Aggregate casualties by states
aggAnalytical2 <- with(analyticalData, aggregate(analyticalData$Casualties, by=list(analyticalData$STATENAME), sum))
names(aggAnalytical2)[names(aggAnalytical2) == 'Group.1'] <- 'States'
names(aggAnalytical2)[names(aggAnalytical2) == 'x'] <- 'Nb_Casualties'

## Top 10 States most at risk from 1993-2011 (casualties)
agg2 <- aggAnalytical2[order(-aggAnalytical2$Nb_Casualties), ]
agg2[1:10, ]
##          States Nb_Casualties
## 43        Texas         10998
## 25     Missouri          7309
## 1       Alabama          4825
## 9       Florida          4043
## 5    California          3738
## 35         Ohio          3056
## 42    Tennessee          3032
## 13     Illinois          2958
## 38 Pennsylvania          2832
## 10      Georgia          2825

Conclusion: Texas is by far the state with the highest number of casualties between 1993-2011, followed by Missouri, Alabama, Florida and California.

## Create a US map to show the number of casualties per state.
library(maps)
library(ggplot2)
library(grDevices)

## First read the data and add a column with the names of the states:
aggAnalytical3 <- with(analyticalData, aggregate(analyticalData$Casualties, by=list(analyticalData$STATE, analyticalData$STATENAME), sum))

names(aggAnalytical3)[names(aggAnalytical3) == 'Group.1'] <- 'state'
names(aggAnalytical3)[names(aggAnalytical3) == 'x'] <- 'Casualties'
names(aggAnalytical3)[names(aggAnalytical3) == 'Group.2'] <- 'region'
agg3 <- aggAnalytical3[order(aggAnalytical3$state), ]
agg3$state <- as.factor(agg3$state)
agg3$region <- tolower(as.factor(agg3$region))

## Create a map            
states <- map_data("state")
agg3merged <- merge(states, agg3, by = "region")
agg3merged <- agg3merged[order(agg3merged$order), ]

pal <- colorRampPalette(c("white", "red"))
breaks <- c(quantile(agg3merged$Casualties))

q <- ggplot(data = agg3merged, aes(x = long, y = lat, group = group))
q <- q + geom_polygon(color = "black", aes(fill = factor(Casualties)))
q <- q + ggtitle("Accumulated Number of Casualties between 1993 and 2011
                 due to Severe Weather Events")
q <- q + scale_fill_manual( values = pal(48), breaks= breaks ) 
q <- q + labs(fill="Number of\nCasualties")

Fig 1.1 US States showing the total number of casualties due to severe weather events between 1993 and 2011

q

Events with the Greatest Economic Consequences between 1993-2011

We first look at which types of events have the greatest economic consequences.

## Aggregate Total Damage by most harmful events
aggAnalytical4 <- with(analyticalData, aggregate(analyticalData$TotalDamage, by=list(analyticalData$EVTYPE), sum))
names(aggAnalytical4)[names(aggAnalytical4) == 'Group.1'] <- 'Harmful_Events'
names(aggAnalytical4)[names(aggAnalytical4) == 'x'] <- 'Total_Damage'

## Top 10 most harmful events (economic consequences)
agg4 <- aggAnalytical4[order(-aggAnalytical4$Total_Damage), ]
## USD Currency
library(scales)
agg4$Total_Damage <- dollar(agg4$Total_Damage)
agg4[1:10, ]
##        Harmful_Events     Total_Damage
## 167             FLOOD $150,137,226,757
## 399 HURRICANE/TYPHOON  $71,636,600,800
## 647       STORM SURGE  $43,323,466,000
## 811           TORNADO  $26,753,187,479
## 241              HAIL  $18,758,215,016
## 152       FLASH FLOOD  $17,266,310,517
## 94            DROUGHT  $15,013,467,000
## 390         HURRICANE  $12,103,928,010
## 571       RIVER FLOOD  $10,148,401,000
## 415         ICE STORM   $8,966,471,360
##Barplot Foundation for Summary Section
agg4A <- aggAnalytical4[order(-aggAnalytical4$Total_Damage), ]
top2 <- agg4A[1:10, ]

Conclusion: Floods are by far the event that has the greatest economic consequence in the US, followed by hurricanes/typhoons, storm surges, tornados and hail.

Secondly, we look at which states in the US had the greatest economic costs.

## Aggregate Total Damage by States
aggAnalytical5 <- with(analyticalData, aggregate(analyticalData$TotalDamage, by=list(analyticalData$STATENAME), sum))
names(aggAnalytical5)[names(aggAnalytical5) == 'Group.1'] <- 'States'
names(aggAnalytical5)[names(aggAnalytical5) == 'x'] <- 'Total_Damage'

## Top 10 States most at risk from 1993-2011 (economic consequences)
agg5 <- aggAnalytical5[order(-aggAnalytical5$Total_Damage), ]
agg5$Total_Damage <- dollar(agg5$Total_Damage)
agg5[1:10, ]
##            States     Total_Damage
## 5      California $127,029,596,053
## 18      Louisiana  $60,396,916,500
## 9         Florida  $44,832,174,923
## 24    Mississippi  $35,347,581,002
## 43          Texas  $31,280,162,407
## 1         Alabama  $16,592,394,290
## 13       Illinois  $12,750,030,152
## 33 North Carolina   $9,348,983,251
## 15           Iowa   $8,212,999,790
## 25       Missouri   $6,812,246,768

Conclusion: California is, by far, the US State that suffered the heaviest economic consequences of severe weather events, followed by Louisiana, Florida, Mississipi and Texas.

## Create a US map to show the number of casualties per state.
library(maps)
library(ggplot2)
library(grDevices)
## First read the data and add a column with the names of the states:
aggAnalytical6 <- with(analyticalData, aggregate(analyticalData$TotalDamage, by=list(analyticalData$STATE, analyticalData$STATENAME), sum))

names(aggAnalytical6)[names(aggAnalytical6) == 'Group.1'] <- 'state'
names(aggAnalytical6)[names(aggAnalytical6) == 'x'] <- 'Total_Damage'
names(aggAnalytical6)[names(aggAnalytical6) == 'Group.2'] <- 'region'
agg6 <- aggAnalytical6[order(aggAnalytical6$state), ]
agg6$state <- as.factor(agg6$state)
agg6$region <- tolower(as.factor(agg6$region))

## Create a map            
states2 <- map_data("state")
agg6merged <- merge(states2, agg6, by = "region")
agg6merged <- agg6merged[order(agg6merged$order), ]

pal2 <- colorRampPalette(c("white", "green"))
breaks2 <- c(quantile(agg6merged$Total_Damage))

q <- ggplot(data = agg6merged, aes(x = long, y = lat, group = group))
q <- q + geom_polygon(color = "black", aes(fill = factor(Total_Damage)))
q <- q + ggtitle("Property and Crop Damage due to Severe Weather Events
                 between 1993 and 2011 (in USD)")
q <- q + scale_fill_manual( values = pal2(48), breaks= breaks2 ) 
q <- q + labs(fill="Total Damage in USD")

Fig 2.1: US States showing the total damage (Property and Crop) in USD due to severe weather events between 1993 and 2011

q

Summary of the Analysis and Possibilities for Further Research

Summary

par(mfrow = c(2, 1), mar=c(0.5,5,0.75,1)+1.4)
# plot 1 (graf 1.1)
barplot(top$Nb_Casualties, names.arg=top$Harmful_Events, horiz=TRUE, cex.names=0.5, las=1, ylim=rev(c(0,14)),
        beside=TRUE,axes=TRUE,col="red", density=NA,legend.text=TRUE,xlim=c(0,30000), space=0.4, cex=0.5,
        xlab="", ylab="", main="Top 10 Most Harmful Events - Casualties")

# plot 2 (graf 2.1)
barplot(top2$Total_Damage, names.arg=top2$Harmful_Events, horiz=TRUE, cex.names=0.5, las=1, ylim=rev(c(0,14)),
        beside=TRUE,axes=TRUE,col="green", density=NA,legend.text=TRUE,xlim=c(0,200000000000), space=0.4, 
        cex=0.5, xlab="", ylab="", main="Top 10 Most Harmful Events - Total Damage USD")

  1. The types of events that are most harmful with respect to population health are: tornados, excessive heat, floods, lightning and thunderstorm winds. Tornados cause 3 times more casualties than any other severe weather events in the US. They were responsible for almost 25.000 fatalities or injuries between 1993 and 2011.
  2. The most harmful events are located primarily in Texas, Missouri, Alabama, Florida and California. Almost 11.000 casualties were reported in Texas alone (40%) during 1993-2011.
  3. The types of events that have the greatest economic consequences are: floods, hurricanes/typhoons, storm surges, tornados and hail. Floods are responsible for more than 150 Billion USD in property and crop damages since 1993. This is more than twice the total damage caused by any other severe weather events. In second and fourth position, hurricanes, typhoons and tornados caused damages for more than 98 Billion USD since 1993.
  4. The types of events with greatest economic consequences are primarily located in California, Louisiana, Florida, Mississippi and Texas. The southwest of the mainland US is particularly exposed to severe weather events.

Further Research

  1. Tornado Trends and increased planning support? Further investigation of casualties due to tornados could indicate a decline/negative trend with respect to casualties after 1980. In comparison, there seem to be an increase in casualties due to tornados between 1950 and 1980. A possible t-test could help investigate this further if this is of interest.
  2. We could not find indications to support the hypothesis of an increase in severe weather events in terms of casualties. However, correlation between NOAA Storm Database and indications of global warming should be of interest for further research.