Synopsis

In this analysis we used the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, containing data and characteristics of major storms and weather events in the United States aiming to find:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

The data used in the analysis was download from the course web site: Storm Data (47Mb).

There is also some documentation of the database available at NOAA, where we can find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Basic Setup

We use the following library to perfom our data manipulation also set some constants to be used in our dara processing.

## libraries
library(dplyr)
library(ggplot2)
library(reshape2)

## constants and env parameters
DATA_DIR <- "./data/"
DATA_FILE <- paste(DATA_DIR,"repdata-data-StormData.csv.bz2", sep="")
DATA_URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

Data File

We download the data file and loads the raw data, directly from the compressed file, to be prepared to analysis.

## there is an data dir?
if(!file.exists(DATA_DIR)) dir.create(DATA_DIR)

## unzip and database load
if(!file.exists(DATA_FILE)) download.file(DATA_URL, DATA_FILE)

## load the data set (directly from zip file)
data <- read.csv(DATA_FILE)

## looking the size of datafile
dim(data)
## [1] 902297     37
colnames(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Not all the variables will be used in our study. To determine the economic impact and the harmfulness are necessary:

Variable Description
EVTYPE Indicates the type of weather event
FATALITIES Estimation of the number fatalities
INJURIES Estimation of the number of injuried people
PROPDMG and PROPDMGEXP Estimation of US$ loss caused by properties damages
CROPDMG and CROPDMGEXP Estimation of US$ loss caused by crop damages

Observarion: for US$ estimation, the datafile combines and value in the PROPDMG (or CROPDMG) to be multiply by 10PROPDMGEXP (OR 10CROPDMGEXP).

## subsetting
dt <- select(data, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Results

Weather Event Harmfulness

Question: Across the United States, which types of events are most harmful with respect to population health?

In our data, we’ll define HARMFULNESS as the sum of FATALITIES and INJURIES, and sum this data by EVTYPE (event type) cross the historic data.

HARMFUL = FATALITIES + INJURIES

## harmful defined as fatalities + injuries
dt$HARMFUL <- dt$FATALITIES + dt$INJURIES

## grouping em summarizing by event type
health_impact <- dt %>% 
      group_by(EVTYPE) %>% 
      summarise(
          fat = sum(FATALITIES,rm.na=TRUE), 
          inj=sum(INJURIES,rm.na=TRUE), 
          harm=sum(HARMFUL, rm.na=TRUE), n=n()
      ) %>%
      arrange(desc(harm),desc(fat), desc(inj))

The top 10 harmful weather event, in absolute number, where ‘n’ is the number of events registered, are:

head(health_impact,10)
## Source: local data frame [10 x 5]
## 
##               EVTYPE  fat   inj  harm      n
## 1            TORNADO 5634 91347 96980  60652
## 2     EXCESSIVE HEAT 1904  6526  8429   1678
## 3          TSTM WIND  505  6958  7462 219940
## 4              FLOOD  471  6790  7260  25326
## 5          LIGHTNING  817  5231  6047  15754
## 6               HEAT  938  2101  3038    767
## 7        FLASH FLOOD  979  1778  2756  54277
## 8          ICE STORM   90  1976  2065   2006
## 9  THUNDERSTORM WIND  134  1489  1622  82563
## 10      WINTER STORM  207  1322  1528  11433
## stacked version spliting FATALITIES and INJURIED
hi.m <- melt(select(head(health_impact,10), EVTYPE, fat, inj),id.vars="EVTYPE")
hi_chart <- ggplot(hi.m, aes(x=reorder(EVTYPE,value), y=value, fill=variable))
hi_chart <- hi_chart + geom_bar(stat="identity") + coord_flip() + theme_bw()
hi_chart <- hi_chart + ylab("Fatalities + Injuries") + xlab("Weather Events")
hi_chart <- hi_chart + ggtitle("Top 10 Harmful Weather Events")
hi_chart

Another way to see a harmfulness of a weather event is to see how much deadly (looking only at fatalities) and how dangerous it is, not see the absolute number of fatalities and injuries but the rate of this numbers

DANGER = (FATALITIES + INJURIES) / NUMBER OF OCCURRENCES

## seeing the top 10 mortal
head(arrange(health_impact, desc(fat),desc(inj),desc(harm)),10)
## Source: local data frame [10 x 5]
## 
##            EVTYPE  fat   inj  harm      n
## 1         TORNADO 5634 91347 96980  60652
## 2  EXCESSIVE HEAT 1904  6526  8429   1678
## 3     FLASH FLOOD  979  1778  2756  54277
## 4            HEAT  938  2101  3038    767
## 5       LIGHTNING  817  5231  6047  15754
## 6       TSTM WIND  505  6958  7462 219940
## 7           FLOOD  471  6790  7260  25326
## 8     RIP CURRENT  369   233   601    470
## 9       HIGH WIND  249  1138  1386  20212
## 10      AVALANCHE  225   171   395    386
## more dangerous type of event
health_impact$danger <- health_impact$harm/health_impact$n
head(arrange(health_impact, desc(danger),desc(harm),desc(fat),desc(inj)),10)
## Source: local data frame [10 x 6]
## 
##                        EVTYPE fat inj harm n danger
## 1                   Heat Wave   1  71   71 1   71.0
## 2       TROPICAL STORM GORDON   9  44   52 1   52.0
## 3                  WILD FIRES   4 151  154 4   38.5
## 4               THUNDERSTORMW   1  28   28 1   28.0
## 5  TORNADOES, TSTM WIND, HAIL  26   1   26 1   26.0
## 6          HIGH WIND AND SEAS   4  21   24 1   24.0
## 7           HEAT WAVE DROUGHT   5  16   20 1   20.0
## 8             SNOW/HIGH WINDS   1  37   37 2   18.5
## 9     WINTER STORM HIGH WINDS   2  16   17 1   17.0
## 10            GLAZE/ICE STORM   1  16   16 1   16.0

Economic Impact

Across the United States, which types of events have the greatest economic consequences?

Economic Consequences will be defined as total loss of properties and crop damages in dolars, or in our dataset as:

total_loss = PROPDMG * 10PROPDMGEXP + CROPDMG * 10CROPDMGEXP

The exponent information (PROPDMGEXP and CROPDMGEXP) of damage values are encoded as:

levels(dt$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

So, it is necessary to convert the notation to a actual number of 10 powers. We used the conversion table found at https://rpubs.com/flyingdisc/PROPDMGEXP

Exp Notation Order Value
H,h hundreds = 100
K,k kilos = thousands = 1,000
M,m millions = 1,000,000
B,b billions = 1,000,000,000
(+) 1
(-) 0
(?) 0
black/empty character 0
numeric 0..8 10
## mapping factor to a ten power
exp_keys <- levels(dt$PROPDMGEXP)
exp_power <- c(0,0,0,0,1,1,1,1,1,1,1,1,1,9,2,2,3,6,6)

## seeing the map
exp_mapping <- data.frame(exp_keys,exp_power)
exp_mapping
##    exp_keys exp_power
## 1                   0
## 2         -         0
## 3         ?         0
## 4         +         0
## 5         0         1
## 6         1         1
## 7         2         1
## 8         3         1
## 9         4         1
## 10        5         1
## 11        6         1
## 12        7         1
## 13        8         1
## 14        B         9
## 15        h         2
## 16        H         2
## 17        K         3
## 18        m         6
## 19        M         6
## converting PROPDMGEXP
dt$PROPDMGEXP <- as.character(dt$PROPDMGEXP)
dt <- merge(dt,exp_mapping, by.x="PROPDMGEXP", by.y="exp_keys")
colnames(dt)[9]<-"PROPDMGEXP.n"

## converting CROPDMGEXP
dt$CROPDMGEXP <- as.character(dt$CROPDMGEXP)
dt <- merge(dt,exp_mapping, by.x="CROPDMGEXP", by.y="exp_keys")
colnames(dt)[10]<-"CROPDMGEXP.n"

## calculating total loss
dt$total_loss <- dt$PROPDMG * (10^dt$PROPDMGEXP.n) + dt$CROPDMG * (10^dt$CROPDMGEXP.n)

After convertion we can summarize the economic impact by event type and see the top 10.

## summarizing by event type
economic_impact <- dt %>% 
    group_by(EVTYPE) %>% 
    summarise(total_loss = sum(total_loss,rm.na=TRUE)) %>%
    arrange(desc(total_loss))

## seeing the top 10
head(economic_impact,10)
## Source: local data frame [10 x 2]
## 
##               EVTYPE   total_loss
## 1              FLOOD 150319678258
## 2  HURRICANE/TYPHOON  71913712801
## 3            TORNADO  57352117611
## 4        STORM SURGE  43323541001
## 5               HAIL  18757611588
## 6        FLASH FLOOD  17562132319
## 7            DROUGHT  15018672001
## 8          HURRICANE  14610229011
## 9        RIVER FLOOD  10148404501
## 10         ICE STORM   8967041811
## chart
ei_chart <- ggplot(
                head(economic_impact,10), 
                aes(x=reorder(EVTYPE,total_loss), y=total_loss/10^6))
ei_chart <- ei_chart + geom_bar(stat="identity") + coord_flip() + theme_bw()
ei_chart <- ei_chart + xlab("Weather Event") + ylab("Total Loss (millions USD)")
ei_chart <- ei_chart + ggtitle("Top 10 weather events cause economic loss")
ei_chart

Conclusion

As we see above, the Flood, Hurricane/Typhoon and Tornados are events that bring _more economic impact__, cross US, while Tornados, Excessive Heat, TST Wind and Floods are the events that cause more harm to population.