Weather Storm Types and their Impacts on Public Health and the Economy in the U.S.

Synopsis

In this report, I aim to identify which types of weather storms are most harmful with respect to population health, and which have the gravest economic consequences. My overall hypothesis is that some types of event have a higher impact than others, for each of the two types of consequences studied. To investigate this hypothesis, I obtained the storm dataset from the U.S. National Oceanic and Atmospheric Administration’s (NOAA), which has data from 1950 to 2011. From these data, I found that, on average, across the U.S., Tornados have the most significant publich health impact (both fatalities and injuries), by a large margin; flood has the highest economic impact.

Research Questions

An expedite exploratory analysis of NOAA’s Storm Database was carried out to identify:

  1. Across the United States, which types of events (as indicated in the variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

  1. Loading libraries
# Loading Libraries
library(dplyr)
library(ggplot2)
  1. The dataset was programatically downloaded from the course website;
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "noaa.csv", method = "curl")
  1. The dataset was imported to the project global environment as data;
data <- read.csv(file = "noaa.csv", header = TRUE)
  1. Created copy of data to use for analysis. This is to keep a ready-to-use original version of the data.
    • Dataset used for preprocessing: df (copy of data);
df <- data
  1. Dimensions of the dataset
 dim(df)
## [1] 902297     37
  1. Variables
 names(df)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
  1. Only the necessary variables were kept
df <- df %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

According to the National Weather Service Storm Data Documentation:

VARIABLE DESCRIPTION
EVTYPE Type of weather srtorm event
FATALITIES Number of people killed directly
INJURIES Number of people injured directly
PROPDMG Property damage
PROPDMGEXP Hundreed (H), Thousand (D), Million (M), Billion (B)
CROPDMG Crop damage
CROPDMGEXP Hundreed (H), Thousand (D), Million (M), Billion (B)

The values of property damage are not all in the same unit (see table of variables above).

DAMAGE CODE VALUE
B 1,000,000,000
M 1,000,000
K 1,000
H 100
NA or BLANK 1
  1. (Economic aspects) Convert Property damage to one unit. Then calculate economic damage per type.
dfE <- df # create copy before transformations
# table for dollar value and code equivalency
tequiv <- data.frame(code = c("B", "M", "K", "H", "", NA), value = c(1000000000, 1000000, 1000, 100, 1, 1))#table of equivalency
# update table with correct dollar value by merging with  table of equivalency (code.x)
dfE1 <- merge(dfE, tequiv, by.x = "PROPDMGEXP", by.y = "code")
# update table by merging with cropcode to get dollar value (code.y)
dfE1 <- merge(dfE1, tequiv, by.x = "CROPDMGEXP", by.y = "code")
dfE1$property.damage <- dfE1$PROPDMG * dfE1$value.x # calculate property damage value in dollars
dfE1$crop.damage <- dfE1$CROPDMG * dfE1$value.y # same for crop damage
  1. (Back to Health aspects), A brief visual inspection of the data stored in the variable that describes the type of event EVTYPE, reveals that many records have a description summary..., which shows that they are not events per se. A few examples are shown below:
levels(df$EVTYPE)[721:723]
## [1] "Summary of March 23"    "Summary of March 24"   
## [3] "SUMMARY OF MARCH 24-25"
  1. Such entries (as described in the last point) were removed from the dataset (df).
removeRows <- grep("^[Ss][Uu][Mm][Mm][Aa][Rr][Yy]", df$EVTYPE)
df <- df[-removeRows,]
#confirming that the number of tows removed was tthe intended: expression should evaluate to TRUE
nrow(data) - nrow(df) == length(removeRows)
## [1] TRUE

75 rows (events) were removed from the dataset.

  1. Dollar values are stored in different units (see variable descriptions above), as illustrated below:
df[1:3,]
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0

Which types of events are most harmful to population health?

Calculation of the total number of fatalities and injuries.

totalFatalities <- sum(df$FATALITIES)
totalFatalities
## [1] 15145
totalInjuries <- sum(df$INJURIES)
totalInjuries
## [1] 140528

There are over 800 storm types. Howeve, a quick analysis of their histograms (see below) shows that the vast majority of fatalities (and injuries) are caused by a few types. Furthermore, in order to

hist(fatalities$Total)
hist(injuries$Total)

Furthermore, there is a strong association between fatalities and injuries for all storm types.

df1 <- df %>%
  group_by(EVTYPE) %>%
  summarize(fatalities = sum(FATALITIES), injuries = sum(INJURIES))
correlation1 <- cor(df1$fatalities, df1$injuries)
correlation1
## [1] 0.9438477

The correlation between fatalities and injuries per event type is (0.9438477), as calculated above.

Identify the top 10 causes of fatalities, and order them is descending order.

fatalities <- df %>%
  group_by(EVTYPE) %>%
  summarize(Total = sum(FATALITIES), Percentage = (Total * 100 / totalFatalities)) %>%
  arrange(desc(Percentage))
head(fatalities, 10)
## Source: local data frame [10 x 3]
## 
##            EVTYPE Total Percentage
## 1         TORNADO  5633  37.193793
## 2  EXCESSIVE HEAT  1903  12.565203
## 3     FLASH FLOOD   978   6.457577
## 4            HEAT   937   6.186860
## 5       LIGHTNING   816   5.387917
## 6       TSTM WIND   504   3.327831
## 7           FLOOD   470   3.103334
## 8     RIP CURRENT   368   2.429845
## 9       HIGH WIND   248   1.637504
## 10      AVALANCHE   224   1.479036
fatalities_10 <- fatalities[1:10,]
percentageTop10Fat <- sum(fatalities_10$Percentage)
percentageTop10Fat
## [1] 79.7689

The top 10 storm types account for 80% of fatalities. Tornado is at the type of event responsible for the highest number of fatalities, as documented above. Tornado is also the event type responsible for the highest number of injuries, as can be seeen below:

Identify the top 10 causes of injuries, and order them is descending order.

injuries <- df %>%
  group_by(EVTYPE) %>%
  summarize(Total = sum(INJURIES), Percentage = (Total * 100 / totalInjuries)) %>%
  arrange(desc(Percentage))
head(injuries, 10)
## Source: local data frame [10 x 3]
## 
##               EVTYPE Total Percentage
## 1            TORNADO 91346 65.0019925
## 2          TSTM WIND  6957  4.9506148
## 3              FLOOD  6789  4.8310657
## 4     EXCESSIVE HEAT  6525  4.6432028
## 5          LIGHTNING  5230  3.7216782
## 6               HEAT  2100  1.4943641
## 7          ICE STORM  1975  1.4054139
## 8        FLASH FLOOD  1777  1.2645167
## 9  THUNDERSTORM WIND  1488  1.0588637
## 10              HAIL  1361  0.9684903
injuries_10 <- injuries[1:10,]
percentageTop10Inj <- sum(injuries_10$Percentage)
percentageTop10Inj
## [1] 89.3402

As calculated above, 89% of injuries are cause by the top 10 storm types.

Which types of events have the greatest economic cosequences?

total.property.damage <- sum(dfE1$property.damage)
total.crop.damage <- sum(dfE1$crop.damage)

economic <- dfE1 %>%
  group_by(EVTYPE) %>%
  summarize(totalProperty = sum(property.damage), percProperty =  (totalProperty * 100 / total.property.damage), totalCrop = sum(crop.damage), percCrop = (totalCrop * 100 / total.crop.damage), overall.damage = (totalProperty +  totalCrop))

Both damage present high positive skew (not printed in report, but code available below).

hist(dfE1$property.damage)
hist(dfE1$crop.damage)
correlation2 <- cor(economic$totalProperty, economic$totalCrop)
correlation2
## [1] 0.3784556

Their correlation is moderate at 0.3784556, meaning that the same event type have different impacts on crops and property.

property <- economic %>%
  select(EVTYPE, totalProperty, percProperty) %>%
  arrange(desc(totalProperty))
property_10 <- property[1:10,]
top_tenProperty <- sum(property_10$percProperty)
top_tenProperty
## [1] 88.37599
crop <- economic %>%
  select(EVTYPE, totalCrop, percCrop) %>%
  arrange(desc(totalCrop))
crop_10 <- crop[1:10,]
top_tenCrop <- sum(crop_10$percCrop)
top_tenCrop
## [1] 85.36471

Results

Top-ten Storm Types With the Highest Impact on Public Health

ggplot(data = fatalities_10, aes(x = reorder(EVTYPE, Total), y = Total)) +
         geom_bar(stat = "identity", fill = "#333333") +
         xlab("") +
         ylab("Total Fatalities") +
         coord_flip() +
         ggtitle("Storm Types That Cause Highest Number of Fatalities")

The chart above illustrates the impact of each of the top-ten storm types which cause the highest number of fatalities.

ggplot(data = injuries_10, aes(x = reorder(EVTYPE, Total), y = Total)) +
         geom_bar(stat = "identity", fill = "#468499") +
         xlab("") +
         ylab("Total Injuries") +
         coord_flip() +
         ggtitle("Storm Types That Cause Highest Number of Injuries")

The chart above illustrates the impact of each of the top-ten storm types which cause the highest number of injuries.

economicTotal <- economic %>%
  arrange(desc(overall.damage))

economicTotal <- economicTotal[1:10,]
economicTotal$overall.damage.M <- (economicTotal$overall.damage / 1000000) # overall damage in millions of Dollars

ggplot(data = economicTotal, aes(x =  reorder(EVTYPE, overall.damage.M), y = overall.damage.M)) +
  geom_bar(stat = "identity", fill = "#ff4444") +
  coord_flip() +
  ylab("Overall Damage (millions of U.S. Dollars)") +
  xlab("") +
  ggtitle("Economic Impact of Top-ten Storm Event Types in the U.S.")

The chart above shows the top-ten storm types with the greatest economic impact (property + crops) in the U.S.

Note
Report produced for the specialization in Data Science, Johns Hopkins University.
Joao Pinelo Silva
April 2016

Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis. Source: Assignement brief. Limit 3 figures.