Synopsis

The objective of this report is to explore NOAA (U.S. National Oceanic & Atmospheric Administration) Storm Database and explore which severe weather events have the most harmful impact on the population health and economy. In particular, we are interested in the counts of estimated population fatalities and injuries, and also the size of property and crop damage linked with the specific weather events. To answer these questions we’ll use the database Storm Data (47MB) from NOAA and provide the analysis code to produce tables, figures and summaries required for answering this questions.

This report is a part of Reproducible Research course in Johns Hopkins University Data Specialization track at Coursera and should be provided as a markdown document and published to RPubs where it will be available for peer assessment.


Data Processing

The dataset for this assignment will be downloaded from National Weather Service database that tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The dataset comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

Loading the data

First, let’s load the libraries we’ll use for the analysis.

library(dplyr)
library(reshape2)
library(ggplot2)
library(gridExtra) 
## Loading required package: grid

Next, we’ll download the data from the link provided:

## Create data dir
if(!file.exists("Data")) {
    dir.create("Data")
}
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

## download file if not already downloaded
if (!file.exists("./Data/repdata-data-StormData.csv.bz2")) {
    download.file(URL, destfile = "./Data/repdata-data-StormData.csv.bz2")
    dateDownloaded <- date()
}

Check to see if data has successfully downloaded:

list.files("./Data")
## [1] "repdata-data-StormData.csv.bz2"

We’ll load the dataset next and check its size:

## load dataset if not already loaded
if (!"raw_data" %in% ls()) {
    raw_data <- read.csv(bzfile("./Data/repdata-data-StormData.csv.bz2"))
}

## vars for reporting dataset size
rownum <- dim(raw_data)[1]
colnum <- dim(raw_data)[2]

The dataset consists of 37 variables and 902297 observations.

The variables included in the dataset are:

names(raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

There is also some documentation of the database available where we can find how some of the variables are constructed/defined:

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete, so we’ll only use the more recent data to get more consistent result.

Here’s a sample of raw data:

data <- tbl_df(raw_data)
print(data)
## Source: local data frame [902,297 x 37]
## 
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT    AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA    AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA    AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE    AL
## ..     ...                ...      ...       ...    ...        ...   ...
## Variables not shown: EVTYPE (fctr), BGN_RANGE (dbl), BGN_AZI (fctr),
##   BGN_LOCATI (fctr), END_DATE (fctr), END_TIME (fctr), COUNTY_END (dbl),
##   COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (fctr), END_LOCATI (fctr),
##   LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
##   INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (fctr), CROPDMG (dbl),
##   CROPDMGEXP (fctr), WFO (fctr), STATEOFFIC (fctr), ZONENAMES (fctr),
##   LATITUDE (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl),
##   REMARKS (fctr), REFNUM (dbl)

Omitting the lower quality data from earlier years

We’ll use only the data for the years where number of recorded events is larger than 10,000.

data$Year = as.numeric(format(as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
by_years <- group_by(data, Year)
by_years <- summarize(by_years, Event_count = n())
print(by_years)
storm_data <- filter(data, Year >= 1989)
## Source: local data frame [62 x 2]
## 
##    Year Event_count
## 1  1950         223
## 2  1951         269
## 3  1952         272
## 4  1953         492
## 5  1954         609
## 6  1955        1413
## 7  1956        1703
## 8  1957        2184
## 9  1958        2213
## 10 1959        1813
## 11 1960        1945
## 12 1961        2246
## 13 1962        2389
## 14 1963        1968
## 15 1964        2348
## 16 1965        2855
## 17 1966        2388
## 18 1967        2688
## 19 1968        3312
## 20 1969        2926
## 21 1970        3215
## 22 1971        3471
## 23 1972        2168
## 24 1973        4463
## 25 1974        5386
## 26 1975        4975
## 27 1976        3768
## 28 1977        3728
## 29 1978        3657
## 30 1979        4279
## 31 1980        6146
## 32 1981        4517
## 33 1982        7132
## 34 1983        8322
## 35 1984        7335
## 36 1985        7979
## 37 1986        8726
## 38 1987        7367
## 39 1988        7257
## 40 1989       10410
## 41 1990       10946
## 42 1991       12522
## 43 1992       13534
## 44 1993       12607
## 45 1994       20631
## 46 1995       27970
## 47 1996       32270
## 48 1997       28680
## 49 1998       38128
## 50 1999       31289
## 51 2000       34471
## 52 2001       34962
## 53 2002       36293
## 54 2003       39752
## 55 2004       39363
## 56 2005       39184
## 57 2006       44034
## 58 2007       43289
## 59 2008       55663
## 60 2009       45817
## 61 2010       48161
## 62 2011       62174

From this summary we find that for years 1989-2011 we have sufficient data to satisfy our requirement, so we’ll use only that period data for our analysis.

The main variables of our interest here are:

  • EVTYPE - Event Type
  • MAG - Magnitude
  • FATALITIES - Number of fatalities
  • INJURIES - Number of persons injured
  • PROPDMG - Property damage [USD]
  • PROPDMGEXP - Property damage exponent (thousands, millions, billions…)
  • CROPDMG - Crops damage [USD]
  • CROPDMGEXP - Crops damage exponent (thousands, millions, billions…)

We’ll clean up the data and produce the subset data table with only those variables we are interested in.

storm_data <- subset(storm_data, select=c(8, 23:28))
names(storm_data) <- c("Event_Type", "Fatalities", "Injuries", "Property_Damage", "Property_Damage_Exp", "Crop_Damage", "Crop_Damage_Exp")
str(storm_data)
## Classes 'tbl_df' and 'data.frame':   762150 obs. of  7 variables:
##  $ Event_Type         : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 856 834 856 856 856 244 856 856 ...
##  $ Fatalities         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Injuries           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property_Damage    : num  2.5 250 0 2.5 0 0 0 0 0 0 ...
##  $ Property_Damage_Exp: Factor w/ 19 levels "","-","?","+",..: 19 17 1 19 1 1 1 1 1 1 ...
##  $ Crop_Damage        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Crop_Damage_Exp    : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

PROPDMGEXP and CROPDMGEXP variables denote the multiplier by which we need to multiply corresponding PROPDMG and CROPDMG columns. So the total property and crops damage is calculated like this:

  • Property_Damage = PROPDMG * 10 ^ PROPDMGEXP
  • Crop_Damage = CROPDMG * 10 ^ CROPDMGEXP

Now, we’ll use those 4 original columns create one column for total Property damage and one for total Crop damage.

# Check the levels for exponent variables
unique(storm_data$Property_Damage_Exp)
unique(storm_data$Crop_Damage_Exp)
##  [1] M K   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

We can see that level values for Property and Crop damage exponents are not uniform and are represented as numbers, characters (k, K, m, M, b, B) or even special characters (-, +, ?) and some are simply missing. We’ll convert those level values to meaningful integers for the exponents so we can use them in making single column for Property and Crop damage values (in dollars). All the characters we can’t interpret or are missing we’ll substitute with zeroes.

# Make list of all different exponent values as integers
exp <- list("0" = c("-", "+", "?", "0", "", " "), "2" = c("2", "h", "H"), "3" = c("3", "k", "K"), "6" = c("6", "m", "M"), "9" = c("9", "b", "B"))

# Rename levels for exponent data
levels(storm_data$Property_Damage_Exp) <- exp
levels(storm_data$Crop_Damage_Exp) <- exp

# Convert exponent variables from Factor to Integer 
storm_data$Property_Damage_Exp <- as.numeric(as.character(storm_data$Property_Damage_Exp))
storm_data$Crop_Damage_Exp <- as.numeric(as.character(storm_data$Crop_Damage_Exp))

# Create new variables
storm_data <- mutate(storm_data, Property_Damage = Property_Damage * 10^Property_Damage_Exp)
storm_data <- mutate(storm_data, Crop_Damage = Crop_Damage * 10^Crop_Damage_Exp)

# Remove redundant exponent columns
storm_data <- select(storm_data, -c(5, 7))

The prepared dataset we’ll work with now looks like this after tidying up:

print(storm_data, )
## Source: local data frame [762,150 x 5]
## 
##    Event_Type Fatalities Injuries Property_Damage Crop_Damage
## 1     TORNADO          0        0         2500000           0
## 2     TORNADO          0        0          250000           0
## 3   TSTM WIND          0        0               0           0
## 4     TORNADO          0        0         2500000           0
## 5   TSTM WIND          0        0               0           0
## 6   TSTM WIND          0        0               0           0
## 7   TSTM WIND          0        0               0           0
## 8        HAIL          0        0               0           0
## 9   TSTM WIND          0        0               0           0
## 10  TSTM WIND          0        0               0           0
## ..        ...        ...      ...             ...         ...

Results

The data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The most deadly/harmful weather events in the US from 1989-2011

We’ll group the dataset by the event type and make summary to find out which weather events are the most harmful for human lives.

by_event <- group_by(storm_data, Event_Type)
summary_inj_fat <- summarize(by_event, sum(Injuries), sum(Fatalities))
names(summary_inj_fat)[2:3] <- c("Injuries", "Fatalities")
summary_inj_fat <- mutate(summary_inj_fat, Total_injuries_fatalities = Injuries + Fatalities)
top_inj_fat <- arrange(summary_inj_fat, desc(Total_injuries_fatalities))
head(top_inj_fat, n = 15)
## Source: local data frame [15 x 4]
## 
##           Event_Type Injuries Fatalities Total_injuries_fatalities
## 1            TORNADO    27944       1802                     29746
## 2     EXCESSIVE HEAT     6525       1903                      8428
## 3              FLOOD     6789        470                      7259
## 4          LIGHTNING     5230        816                      6046
## 5          TSTM WIND     5404        356                      5760
## 6               HEAT     2100        937                      3037
## 7        FLASH FLOOD     1777        978                      2755
## 8          ICE STORM     1975         89                      2064
## 9  THUNDERSTORM WIND     1488        133                      1621
## 10      WINTER STORM     1321        206                      1527
## 11         HIGH WIND     1137        248                      1385
## 12 HURRICANE/TYPHOON     1275         64                      1339
## 13              HAIL     1162         15                      1177
## 14        HEAVY SNOW     1021        127                      1148
## 15          WILDFIRE      911         75                       986

Since we can see from this summary that we have some inconsistent naming for the Event_Type variable values, we need to clean up those values and merge some of this event names into one umbrella category. This list of values is compiled using more than 100 most dangerous weather event categories for human lives and material property before tidying them up. After cleaning up we’ll have 16 weather event categories we’ll use in analysis.

event_recode <- list(
    "Cold Weather/Snow" = c("ICE STORM", "WINTER STORM", "AVALANCHE", "HEAVY SNOW", "BLIZZARD", "WINTER WEATHER/MIX", "WINTRY MIX", "WINTER WEATHER MIX", "SNOW SQUALL", "SNOW/HIGH WINDS", "SNOW", "EXTREME WINDCHILL", "WINTER STORM HIGH WINDS", "BLOWING SNOW", "COLD AND SNOW", "WINTER WEATHER", "EXTREME COLD", "EXTREME COLD/WIND CHILL", "COLD/WIND CHILL", "COLD", "HEAVY SNOW/ICE", "HIGH WINDS/SNOW", "WINTER STORM"),
    "Fire" = c("WILDFIRE", "WILD/FOREST FIRE", "WILD FIRES", "HIGH WINDS/COLD"),
    "Flood" = c("FLOOD", "FLASH FLOOD", "URBAN/SML STREAM FLD", "FLOOD/FLASH FLOOD", "FLASH FLOOD/FLOOD", "WATERSPOUT", "FLASH FLOODING", "FLASH FLOOD/FLOOD", "FLOODING", "RIVER FLOOD", "COASTAL FLOOD", "River Flooding", "COASTAL FLOODING", "FLOOD/RAIN/WINDS", "MAJOR FLOOD"),
    "Fog" = c("FOG", "DENSE FOG"),
    "Hail" = c("HAIL", "TSTM WIND/HAIL", "SMALL HAIL", "HAILSTORM"),
    "Heat" = c("EXCESSIVE HEAT", "HEAT", "HEAT WAVE", "EXTREME HEAT", "Heat Wave", "RECORD HEAT", "DRY MICROBURST", "UNSEASONABLY WARM AND DRY", "UNSEASONABLY WARM", "HEAT WAVE DROUGHT", "RECORD/EXCESSIVE HEAT", "DROUGHT"),
    "Hurricane" = c("HURRICANE/TYPHOON", "HURRICANE", "HURRICANE OPAL", "HURRICANE ERIN", "HURRICANE OPAL/HIGH WINDS"),
    "Ice" = c("GLAZE", "ICE", "ICY ROADS", "FREEZING RAIN", "FREEZING DRIZZLE", "GLAZE/ICE STORM", "ICE STORM", "FROST/FREEZE", "FREEZE", "DAMAGING FREEZE", "FROST"),
    "Land Slide" = c("LANDSLIDE"),
    "Rain" = c("HEAVY RAIN", "HEAVY RAIN/SEVERE WEATHER, RAIN", "MIXED PRECIP", "EXCESSIVE RAINFALL", "EXCESSIVE WETNESS", "HEAVY RAINS"),
    "Rip Current" = c("RIP CURRENT", "RIP CURRENTS"), 
    "Thunderstorm" = c("TSTM WIND", "LIGHTNING", "STORM SURGE", "MARINE THUNDERSTORM WIND", "THUNDERSTORMW", "SEVERE THUNDERSTORM", "THUNDERSTORM WINDS", "MARINE TSTM WIND", "THUNDERSTORM", "THUNDERSTORM  WINDS", "THUNDERSTORM WIND"),
    "Tornado" = c("TORNADO", "WATERSPOUT/TORNADO", "TORNADOES, TSTM WIND, HAIL", "TORNADO F2"),
    "Tropical Storm" = c("TROPICAL STORM", "DUST STORM", "TROPICAL STORM GORDON", "DUST DEVIL"),
    "Tsunami/High Surf" = c("HIGH SURF", "TSUNAMI", "HEAVY SURF/HIGH SURF", "HEAVY SURF", "HIGH WIND AND SEAS", "STORM SURGE/TIDE", "HIGH SEAS", "ROUGH SEAS", "MARINE MISHAP", "High Surf", "HIGH SURF ADVISORY"),
    "Wind" = c("THUNDERSTORM WIND", "HIGH WIND", "STRONG WIND", "HIGH WINDS", "WIND", "MARINE STRONG WIND", "STRONG WINDS", "GUSTY WINDS"))

levels(top_inj_fat$Event_Type) <- event_recode
top_inj_fat <- group_by(top_inj_fat, Event_Type)
top_inj_fat <- summarize(top_inj_fat, sum(Injuries), sum(Fatalities), sum(Total_injuries_fatalities))
names(top_inj_fat)[2:4] <- c("Injuries", "Fatalities", "Total_casualties")
top_inj_fat <- arrange(top_inj_fat, desc(Total_casualties))
top_inj_fat[1:15, ]
## Source: local data frame [15 x 4]
## 
##           Event_Type Injuries Fatalities Total_casualties
## 1            Tornado    28002       1830            29832
## 2       Thunderstorm    11663       1269            12932
## 3               Heat     9273       3174            12447
## 4              Flood     8704       1541            10245
## 5  Cold Weather/Snow     4396       1178             5574
## 6               Wind     3344        567             3911
## 7                Ice     2415        118             2533
## 8               Fire     1610         90             1700
## 9          Hurricane     1323        134             1457
## 10              Hail     1267         20             1287
## 11               Fog     1076         80             1156
## 12       Rip Current      529        572             1101
## 13    Tropical Storm      865         90              955
## 14 Tsunami/High Surf      416        220              636
## 15              Rain      302        102              404

Let’s plot the top 10 weather events with highest casualties count in the US from 1989-2011.

# Melting the data for plotting
top_inj_fat <- transform(top_inj_fat, Event_Type = reorder(Event_Type, order(Total_casualties, decreasing = TRUE))) # reorder Event_Type factor levels from more casualties to less for proper plotting later on
top_inj_fat_melted <- melt(top_inj_fat[1:10 ,1:3], id = "Event_Type") # melt top-10 events
names(top_inj_fat_melted)[2:3] <- c("Casualty_Type", "Count")

ggplot(data = top_inj_fat_melted, aes(x = Event_Type, y = Count, fill = Casualty_Type)) + 
    geom_bar(stat = "identity") + 
    labs(title = "Most harmful weather events in the US from 1989-2011", x = "Weather event", y = "Number of persons injured/killed") + 
    scale_fill_manual(values=c("blue", "red")) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1, face="bold"))

We can see that tornado is holding the first position as the most dangerous weather event causing the human injuries and fatalities. We can also observe that thunderstorms, heat, floods, cold weather/snows and wind are also very dangerous when looking at injuries inflicted, as well as fatalities, but we can see that the order regarding the fatalities is somewhat different with **heat* causing the most fatalities.
We can consider other weather events little less dangerous since their casualty count is lower, with very small count of fatalities.

To make better comparation between weather events causing the most injuries and the ones resulting in the most fatalities, we’ll make side by side plot.

top_fatalities <- top_inj_fat[, c(1,3)]
top_injuries <- top_inj_fat[, c(1,2)]
top_fatalities <- arrange(top_fatalities, desc(Fatalities))
top_injuries <- arrange(top_injuries, desc(Injuries))

top_fatalities <- transform(top_fatalities, Event_Type = reorder(Event_Type, order(Fatalities, decreasing = TRUE)))
top_injuries <- transform(top_injuries, Event_Type = reorder(Event_Type, order(Injuries, decreasing = TRUE)))

Inj_plot <- ggplot(top_injuries[1:8, ], aes(x = Event_Type, y = Injuries)) +
    geom_bar(stat = "identity", fill = "blue", color = "black") +
    labs(x = "Weather event type", y = "Number of injuries") +
    coord_cartesian(ylim = c(0, 30000), xlim=c(0, 9)) +
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

Fat_plot <- ggplot(top_fatalities[1:8, ], aes(x = Event_Type, y = Fatalities)) +
    geom_bar(stat = "identity", fill = "red", color = "black") +
    labs(x = "Weather event type", y = "Number of fatalities") +
    coord_cartesian(ylim = c(0, 30000), xlim=c(0, 9)) +
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

grid.arrange(arrangeGrob(Inj_plot, Fat_plot, ncol = 2, main = textGrob("Most harmful weather events causing injuries vs fatalities", vjust = 0.5, gp = gpar(fontface = "bold"))))

From this plot we can conclude that heat really is the most fatal weather event, followed by tornadoes, floods, thunderstorms and cold weather/snow.

Weather events causing the highest damage to properties and crops in the US from 1989-2011

We’ll summarize the data to find out what types of severe weather events inflict the greatest damage to material property in the US.

# Clean up the datase, use new categories for weather events, excluse data with NAs for Property_Damage
levels(by_event$Event_Type) <- event_recode
by_event <-  data.frame(by_event)
damage <- tbl_df(by_event[, c(1, 4, 5)])
Property_NAs <- complete.cases(damage$Property_Damage) 
damage <- damage[which(Property_NAs == TRUE), ] # remove Property_Damage NA's

# Make all unspecified weather events as "Other" category
levels(damage$Event_Type)[which(is.na(damage$Event_Type))] <- "Other"
damage$Event_Type[which(is.na(damage$Event_Type))] <- "Other"
damage$Event_Type <- factor(damage$Event_Type)

# Summarize the data with greatest economic consequences
damage <- group_by(damage, Event_Type)
summary_damage <- summarize(damage, sum(Property_Damage), sum(Crop_Damage))
names(summary_damage)[2:3] <- c("Property_Damage", "Crop_Damage")
summary_damage <- mutate(summary_damage, Total_damage = Property_Damage + Crop_Damage)
summary_damage <- arrange(summary_damage, desc(Total_damage))
summary_damage <- transform(summary_damage, Event_Type = reorder(Event_Type, order(Total_damage, decreasing = TRUE))) # reorder Event_Type factor levels from more total damage to less for proper plotting later on

head(summary_damage, n = 15)
summary_damage <- summary_damage[, 1:3]
##           Event_Type Property_Damage Crop_Damage Total_damage
## 1              Flood    167423099313 12381239200 179804338513
## 2          Hurricane     84705105010  5514792800  90219897810
## 3       Thunderstorm     51750392734   758009228  52508401962
## 4            Tornado     33871276557   417453270  34288729827
## 5               Hail     16017673013  3086443723  19104116736
## 6               Heat      1073164350 14877054500  15950218850
## 7                Ice      3984720560  6890524500  10875245060
## 8               Wind      9548504585  1164588450  10713093035
## 9  Cold Weather/Snow      8467308091  1604290100  10071598191
## 10              Fire      8501543500   409269630   8910813130
## 11    Tropical Storm      7710639880   681946000   8392585880
## 12 Tsunami/High Surf      4886530500      870000   4887400500
## 13             Other      3670999916   336363980   4007363896
## 14              Rain       706533140   935899800   1642432940
## 15        Land Slide       324596000    20017000    344613000

We can see the top 15 weather events with the most negative impact on the economy in the above summary.

# Melting the data for plotting
summary_damage_melted <- melt(summary_damage[1:10, ], id = "Event_Type") # melt top-10 events
names(summary_damage_melted)[2:3] <- c("Damage_Type", "Value")
summary_damage_melted$Value <- round(summary_damage_melted$Value/1000000, 0)

ggplot(data = summary_damage_melted, aes(x = Event_Type, y = Value, fill = Damage_Type)) + 
    geom_bar(stat = "identity") + 
    labs(title = "Weather events that caused most damage in the US from 1989-2011", x = "Weather event", y = "Millions of Dollars") + 
    scale_fill_manual(values=c("blue", "green")) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1, face="bold"))

From this plot we can conclude that floods are by far the most costly for the economy, followed by hurricanes, thunderstorms and tornadoes with the highest property damage inflicted . It’s important to notice here that hurricane is not the same event as tornado; yearly in the US there are 10-15 hurricanes and about 1200 tornadoes - read more about difference between hurricanes and tornadoes here.

Regarding the damage to crops, we observe the heat as having the most negative effect, followed by the floods, ice and hurricanes.


This document was produced with RStudio v0.98.1091 on R v3.1.2.