1. Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

2. Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This report explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and answer the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

This report shows that across the United States the tornado is most harmful for population health, and the floods results in greatest economic loss.

3. Data Processing

Let’s download the raw data

if (!file.exists("repdata-data-StormData.csv.bz2")) 
{
  URL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(URL, destfile = "repdata-data-StormData.csv.bz2", method="curl", mode="wb")
}
raw.data <- read.table("repdata-data-StormData.csv.bz2", header = TRUE, sep = ",", stringsAsFactors = FALSE, na.strings = "NA")

head(raw.data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Let’s check to see FEW event types:

length(unique(raw.data$EVTYPE))
## [1] 985
head(unique(raw.data$EVTYPE), 20)
##  [1] "TORNADO"                   "TSTM WIND"                
##  [3] "HAIL"                      "FREEZING RAIN"            
##  [5] "SNOW"                      "ICE STORM/FLASH FLOOD"    
##  [7] "SNOW/ICE"                  "WINTER STORM"             
##  [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"       
## [11] "RECORD COLD"               "HURRICANE ERIN"           
## [13] "HURRICANE OPAL"            "HEAVY RAIN"               
## [15] "LIGHTNING"                 "THUNDERSTORM WIND"        
## [17] "DENSE FOG"                 "RIP CURRENT"              
## [19] "THUNDERSTORM WINS"         "FLASH FLOOD"

There appears to be duplications. Let’s try to clean and combine some of those.

raw.data$EVTYPE <- gsub("^(HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(RECORD HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(EXTREME HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(Heat).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(EXCESSIVE HEAT).*", "HEAT", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(TSTM).*", "THUNDER STORM", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(THUNDERSTORM).*", "THUNDER STORM", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(TROPICAL STORM).*", "TROPICAL STORM", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(FLASH FLOOD).*", "FLOOD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WIND).*", "WIND", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(STRONG WIND).*", "WIND", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HIGH WIND).*", "WIND", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HURRICANE).*", "HURICCANE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(SNOW).*", "SNOW", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HEAVY SNOW).*", "SNOW", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(FIRE).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WILD/FOREST FIRE).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WILDFIRE).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WILD FIRES).*", "FIRE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HAIL).*", "HAIL", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(BLIZZARD).*", "BLIZZARD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(COLD).*", "COLD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(WINTER WEATHER).*", "COLD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(EXTREME COLD).*", "COLD", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(RIP).*", "RIP", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(FOG).*", "FOG", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(DENSE FOG).*", "FOG", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(AVALANCHE).*", "AVALANCHE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(AVALANCE).*", "AVALANCHE", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(RAIN).*", "RAIN", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HEAVY RAIN).*", "RAIN", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HIGH SURF).*", "SURF", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(HEAVY SURF).*", "SURF", raw.data$EVTYPE)
raw.data$EVTYPE <- gsub("^(SURF).*", "SURF", raw.data$EVTYPE)
length(unique(raw.data$EVTYPE))
## [1] 658

Let’s consider only the required columns we need for analysis:

names(raw.data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
data <- raw.data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
str(data)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

Replace the EMPTY values with ZEROES:

data$FATALITIES[(data$FATALITIES=="")]<-0
data$INJURIES[(data$INJURIES=="")]<-0
data$PROPDMG[(data$PROPDMG=="")]<-0
data$CROPDMG[(data$CROPDMG=="")]<-0

The variables PROPDMGEXP and CROPDMGEXP are the factors of multiplicity for variables PROPDMG and CROPDMG

(propdmg.exp <- unique(data$PROPDMGEXP))
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
(cropdmg.exp <- unique(data$CROPDMGEXP))
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Instead of two variables (value + factor) we will have one (absolute value):

let’s create two vectors with numeric factor values

propdmg.exp.m<-c(1e3,1e6, 1, 1e9,1e6,  1,  1,1e5,1e6,  1,1e4,1e2,1e3,  1,1e7,1e2,  1, 10,1e8)
cropdmg.exp.m<-c( 1,1e6,1e3,1e6,1e9,1,1,1e3,1e2)

Lets map the values, calculate the absolute numeric values and save them into two new variables: ABSPROPDMG and ABSCROPDMG

library(plyr)
propdmg.temp<-mapvalues(data$PROPDMGEXP, propdmg.exp, propdmg.exp.m)
cropdmg.temp<-mapvalues(data$CROPDMGEXP, cropdmg.exp, cropdmg.exp.m)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data<-mutate(data, ABSPROPDMG=as.numeric(as.character(propdmg.temp))*data$PROPDMG, ABSCROPDMG=as.numeric(as.character(cropdmg.temp))*data$CROPDMG)
head(data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0           
##   ABSPROPDMG ABSCROPDMG
## 1      25000          0
## 2       2500          0
## 3      25000          0
## 4       2500          0
## 5       2500          0
## 6       2500          0

Group our dataset based on Event Type (EVTYPE) and calculate the totals for each columns (FATALITIES, INJURIES, ABSPROPDMG, ABSCROPDMG)

data.summary<-data %>% group_by(EVTYPE) %>% summarize(fatalities=sum(FATALITIES), injuries=sum(INJURIES), property.damage=sum(ABSPROPDMG), crop.damage=sum(ABSCROPDMG))

head(data.summary)
## Source: local data frame [6 x 5]
## 
##                  EVTYPE fatalities injuries property.damage crop.damage
##                   (chr)      (dbl)    (dbl)           (dbl)       (dbl)
## 1    HIGH SURF ADVISORY          0        0          200000           0
## 2         COASTAL FLOOD          0        0               0           0
## 3           FLASH FLOOD          0        0           50000           0
## 4             LIGHTNING          0        0               0           0
## 5             TSTM WIND          0        0         8100000           0
## 6       TSTM WIND (G45)          0        0            8000           0

Public Health Impact would include both fatalities and injuries:

data.phi <- data.summary %>% select(EVTYPE, fatalities, injuries) %>% mutate(total.phi=fatalities+injuries) %>%arrange(desc(total.phi))
data.phi[1:5,]
## Source: local data frame [5 x 4]
## 
##          EVTYPE fatalities injuries total.phi
##           (chr)      (dbl)    (dbl)     (dbl)
## 1       TORNADO       5633    91346     96979
## 2          HEAT       3119     9224     12343
## 3 THUNDER STORM        710     9508     10218
## 4         FLOOD       1488     8574     10062
## 5     LIGHTNING        816     5230      6046

From the above, TORNADO is the most harmful weather event across US:

library(tidyr)
library(ggplot2)
public.health<-gather(data.phi[1:5,1:3],Type, Total, fatalities:injuries)
head(public.health)
## Source: local data frame [6 x 3]
## 
##          EVTYPE       Type Total
##           (chr)      (chr) (dbl)
## 1       TORNADO fatalities  5633
## 2          HEAT fatalities  3119
## 3 THUNDER STORM fatalities   710
## 4         FLOOD fatalities  1488
## 5     LIGHTNING fatalities   816
## 6       TORNADO   injuries 91346
g<-ggplot(public.health, aes(x=reorder(EVTYPE, +Total), y=Total, fill=Type)) +
        ggtitle("Total fatalities & injuries") + 
        xlab("Event Type") +      
        geom_bar(stat="identity", alpha=1) + 
        coord_flip()
print(g)

Let’s define the economical impact as property.damage + crop.damage and see which types of events have the greatest economic consequences:

data.econ<-data.summary%>%select(EVTYPE, property.damage,crop.damage)%>%mutate(total.e=property.damage+crop.damage)%>%arrange(desc(total.e))
data.econ[1:5,]
## Source: local data frame [5 x 4]
## 
##        EVTYPE property.damage crop.damage      total.e
##         (chr)           (dbl)       (dbl)        (dbl)
## 1       FLOOD    162072390896  7099131600 169171522496
## 2   HURICCANE     84756180010  5515292800  90271472810
## 3     TORNADO     56947380677   414953270  57362333947
## 4 STORM SURGE     43323536000        5000  43323541000
## 5        HAIL     15977470013  3026094623  19003564636

FLOOD is the event that have the greatest economic consequences Let’s illustrate:

econ<-gather(data.econ[1:5,1:3],Type, Total, crop.damage:property.damage)
q<-ggplot(econ, aes(x=reorder(EVTYPE, +Total), y=Total, fill=Type)) +
        ggtitle("Total Property & Crop Damage") + 
        xlab("Event Type") +      
        geom_bar(stat="identity", alpha=1) + 
        coord_flip()
print(q)

Results

The above analysis shows that the tornadoes are the most harmful events for people’s health in the United States. This report also indicates that the flood events cause the most economic consequences in United States.