In this analysis we used the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, containing data and characteristics of major storms and weather events in the United States aiming to find:
The data used in the analysis was download from the course web site: Storm Data (47Mb).
There is also some documentation of the database available at NOAA, where we can find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We use the following library to perfom our data manipulation also set some constants to be used in our dara processing.
## libraries
library(dplyr)
library(ggplot2)
library(reshape2)
## constants and env parameters
DATA_DIR <- "./data/"
DATA_FILE <- paste(DATA_DIR,"repdata-data-StormData.csv.bz2", sep="")
DATA_URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
We download the data file and loads the raw data, directly from the compressed file, to be prepared to analysis.
## there is an data dir?
if(!file.exists(DATA_DIR)) dir.create(DATA_DIR)
## unzip and database load
if(!file.exists(DATA_FILE)) download.file(DATA_URL, DATA_FILE)
## load the data set (directly from zip file)
data <- read.csv(DATA_FILE)
## looking the size of datafile
dim(data)
## [1] 902297 37
colnames(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Not all the variables will be used in our study. To determine the economic impact and the harmfulness are necessary:
| Variable | Description |
|---|---|
| EVTYPE | Indicates the type of weather event |
| FATALITIES | Estimation of the number fatalities |
| INJURIES | Estimation of the number of injuried people |
| PROPDMG and PROPDMGEXP | Estimation of US$ loss caused by properties damages |
| CROPDMG and CROPDMGEXP | Estimation of US$ loss caused by crop damages |
Observarion: for US$ estimation, the datafile combines and value in the PROPDMG (or CROPDMG) to be multiply by 10PROPDMGEXP (OR 10CROPDMGEXP).
## subsetting
dt <- select(data, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Question: Across the United States, which types of events are most harmful with respect to population health?
In our data, we’ll define HARMFULNESS as the sum of FATALITIES and INJURIES, and sum this data by EVTYPE (event type) cross the historic data.
HARMFUL = FATALITIES + INJURIES
## harmful defined as fatalities + injuries
dt$HARMFUL <- dt$FATALITIES + dt$INJURIES
## grouping em summarizing by event type
health_impact <- dt %>%
group_by(EVTYPE) %>%
summarise(
fat = sum(FATALITIES,rm.na=TRUE),
inj=sum(INJURIES,rm.na=TRUE),
harm=sum(HARMFUL, rm.na=TRUE), n=n()
) %>%
arrange(desc(harm),desc(fat), desc(inj))
The top 10 harmful weather event, in absolute number, where ‘n’ is the number of events registered, are:
head(health_impact,10)
## Source: local data frame [10 x 5]
##
## EVTYPE fat inj harm n
## 1 TORNADO 5634 91347 96980 60652
## 2 EXCESSIVE HEAT 1904 6526 8429 1678
## 3 TSTM WIND 505 6958 7462 219940
## 4 FLOOD 471 6790 7260 25326
## 5 LIGHTNING 817 5231 6047 15754
## 6 HEAT 938 2101 3038 767
## 7 FLASH FLOOD 979 1778 2756 54277
## 8 ICE STORM 90 1976 2065 2006
## 9 THUNDERSTORM WIND 134 1489 1622 82563
## 10 WINTER STORM 207 1322 1528 11433
## stacked version spliting FATALITIES and INJURIED
hi.m <- melt(select(head(health_impact,10), EVTYPE, fat, inj),id.vars="EVTYPE")
hi_chart <- ggplot(hi.m, aes(x=reorder(EVTYPE,value), y=value, fill=variable))
hi_chart <- hi_chart + geom_bar(stat="identity") + coord_flip() + theme_bw()
hi_chart <- hi_chart + ylab("Fatalities + Injuries") + xlab("Weather Events")
hi_chart <- hi_chart + ggtitle("Top 10 Harmful Weather Events")
hi_chart
Another way to see a harmfulness of a weather event is to see how much deadly (looking only at fatalities) and how dangerous it is, not see the absolute number of fatalities and injuries but the rate of this numbers
DANGER = (FATALITIES + INJURIES) / NUMBER OF OCCURRENCES
## seeing the top 10 mortal
head(arrange(health_impact, desc(fat),desc(inj),desc(harm)),10)
## Source: local data frame [10 x 5]
##
## EVTYPE fat inj harm n
## 1 TORNADO 5634 91347 96980 60652
## 2 EXCESSIVE HEAT 1904 6526 8429 1678
## 3 FLASH FLOOD 979 1778 2756 54277
## 4 HEAT 938 2101 3038 767
## 5 LIGHTNING 817 5231 6047 15754
## 6 TSTM WIND 505 6958 7462 219940
## 7 FLOOD 471 6790 7260 25326
## 8 RIP CURRENT 369 233 601 470
## 9 HIGH WIND 249 1138 1386 20212
## 10 AVALANCHE 225 171 395 386
## more dangerous type of event
health_impact$danger <- health_impact$harm/health_impact$n
head(arrange(health_impact, desc(danger),desc(harm),desc(fat),desc(inj)),10)
## Source: local data frame [10 x 6]
##
## EVTYPE fat inj harm n danger
## 1 Heat Wave 1 71 71 1 71.0
## 2 TROPICAL STORM GORDON 9 44 52 1 52.0
## 3 WILD FIRES 4 151 154 4 38.5
## 4 THUNDERSTORMW 1 28 28 1 28.0
## 5 TORNADOES, TSTM WIND, HAIL 26 1 26 1 26.0
## 6 HIGH WIND AND SEAS 4 21 24 1 24.0
## 7 HEAT WAVE DROUGHT 5 16 20 1 20.0
## 8 SNOW/HIGH WINDS 1 37 37 2 18.5
## 9 WINTER STORM HIGH WINDS 2 16 17 1 17.0
## 10 GLAZE/ICE STORM 1 16 16 1 16.0
Across the United States, which types of events have the greatest economic consequences?
Economic Consequences will be defined as total loss of properties and crop damages in dolars, or in our dataset as:
total_loss = PROPDMG * 10PROPDMGEXP + CROPDMG * 10CROPDMGEXP
The exponent information (PROPDMGEXP and CROPDMGEXP) of damage values are encoded as:
levels(dt$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
So, it is necessary to convert the notation to a actual number of 10 powers. We used the conversion table found at https://rpubs.com/flyingdisc/PROPDMGEXP
| Exp Notation | Order Value |
|---|---|
| H,h | hundreds = 100 |
| K,k | kilos = thousands = 1,000 |
| M,m | millions = 1,000,000 |
| B,b | billions = 1,000,000,000 |
| (+) | 1 |
| (-) | 0 |
| (?) | 0 |
| black/empty character | 0 |
| numeric 0..8 | 10 |
## mapping factor to a ten power
exp_keys <- levels(dt$PROPDMGEXP)
exp_power <- c(0,0,0,0,1,1,1,1,1,1,1,1,1,9,2,2,3,6,6)
## seeing the map
exp_mapping <- data.frame(exp_keys,exp_power)
exp_mapping
## exp_keys exp_power
## 1 0
## 2 - 0
## 3 ? 0
## 4 + 0
## 5 0 1
## 6 1 1
## 7 2 1
## 8 3 1
## 9 4 1
## 10 5 1
## 11 6 1
## 12 7 1
## 13 8 1
## 14 B 9
## 15 h 2
## 16 H 2
## 17 K 3
## 18 m 6
## 19 M 6
## converting PROPDMGEXP
dt$PROPDMGEXP <- as.character(dt$PROPDMGEXP)
dt <- merge(dt,exp_mapping, by.x="PROPDMGEXP", by.y="exp_keys")
colnames(dt)[9]<-"PROPDMGEXP.n"
## converting CROPDMGEXP
dt$CROPDMGEXP <- as.character(dt$CROPDMGEXP)
dt <- merge(dt,exp_mapping, by.x="CROPDMGEXP", by.y="exp_keys")
colnames(dt)[10]<-"CROPDMGEXP.n"
## calculating total loss
dt$total_loss <- dt$PROPDMG * (10^dt$PROPDMGEXP.n) + dt$CROPDMG * (10^dt$CROPDMGEXP.n)
After convertion we can summarize the economic impact by event type and see the top 10.
## summarizing by event type
economic_impact <- dt %>%
group_by(EVTYPE) %>%
summarise(total_loss = sum(total_loss,rm.na=TRUE)) %>%
arrange(desc(total_loss))
## seeing the top 10
head(economic_impact,10)
## Source: local data frame [10 x 2]
##
## EVTYPE total_loss
## 1 FLOOD 150319678258
## 2 HURRICANE/TYPHOON 71913712801
## 3 TORNADO 57352117611
## 4 STORM SURGE 43323541001
## 5 HAIL 18757611588
## 6 FLASH FLOOD 17562132319
## 7 DROUGHT 15018672001
## 8 HURRICANE 14610229011
## 9 RIVER FLOOD 10148404501
## 10 ICE STORM 8967041811
## chart
ei_chart <- ggplot(
head(economic_impact,10),
aes(x=reorder(EVTYPE,total_loss), y=total_loss/10^6))
ei_chart <- ei_chart + geom_bar(stat="identity") + coord_flip() + theme_bw()
ei_chart <- ei_chart + xlab("Weather Event") + ylab("Total Loss (millions USD)")
ei_chart <- ei_chart + ggtitle("Top 10 weather events cause economic loss")
ei_chart
As we see above, the Flood, Hurricane/Typhoon and Tornados are events that bring _more economic impact__, cross US, while Tornados, Excessive Heat, TST Wind and Floods are the events that cause more harm to population.