Synopsis

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events:

1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2.Across the United States, which types of events have the greatest economic consequences?

Main steps:

  1. Uncompress the zip file in the working directory and save only relevant columns (event type and damages)

  2. Create two new columns containing multipliers to calculate total amounts in dollar

  3. Calculate sum of damages by event type

  4. Order the data, and save only the top 10 event types contributing to the each damage

  5. Draw barplots to demostrate the results

Description database

This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

A data frame with 902297 observations on 37 variables.

Important variables:

EVTYPE - event type (TORNADO, TSTM WIND, HAIL, FREEZING RAIN, …)

FATALITIES - number of people died

INJURIES - number of people injuured

PROPDMG - amount of property damage (measured in money)

PROPDMGEXP - unit of damage (B,M,K,H,…)

CROPDMG - amount of corp damage (measured in money)

CROPDMGEXP - unit of damage (B,M,K,H,…)

library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(doBy)
## Loading required package: survival
library("knitr")

Data processing

Loading and preprocessing the data

allStormData=read.csv("~/Downloads/repdata-data-StormData.csv.bz2")
dim(allStormData)
## [1] 902297     37

Select the variables from dataset, which we will be used later.

stormData = allStormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(stormData)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9
# Pandoc tables
kable(head(stormData), format = "pandoc")
EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
TORNADO 0 15 25.0 K 0
TORNADO 0 0 2.5 K 0
TORNADO 0 2 25.0 K 0
TORNADO 0 2 2.5 K 0
TORNADO 0 2 2.5 K 0
TORNADO 0 6 2.5 K 0

Question 1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Select corrective data, which we will be used to see which event is the most harmful respect to population health.

sub<-subset(stormData,
           select=c("EVTYPE", "FATALITIES", "INJURIES"), 
           ((!is.na(FATALITIES)) & (!is.na(INJURIES)) &
           ((FATALITIES > 0) |  (INJURIES > 0)) ))
summary(sub)
##                EVTYPE       FATALITIES          INJURIES       
##  TORNADO          :7928   Min.   :  0.0000   Min.   :   0.000  
##  LIGHTNING        :3305   1st Qu.:  0.0000   1st Qu.:   1.000  
##  TSTM WIND        :2930   Median :  0.0000   Median :   1.000  
##  FLASH FLOOD      : 931   Mean   :  0.6906   Mean   :   6.408  
##  THUNDERSTORM WIND: 682   3rd Qu.:  1.0000   3rd Qu.:   3.000  
##  EXCESSIVE HEAT   : 678   Max.   :583.0000   Max.   :1700.000  
##  (Other)          :5475
#Group data by variable “EVTYPE” and calculate harmful
sub=summaryBy(FATALITIES+INJURIES~EVTYPE, data=sub, FUN=sum) 

# Order the number of FATALITIES.sum + INJURIES.sum by function arrange, using decending method
sub = arrange(sub, desc(FATALITIES.sum + INJURIES.sum))

#top 10 events which are most harmful
health=sub[1:10,]

Question 2.Across the United States, which types of events have the greatest economic consequences?

Select corrective data, which we will be used to see which event is the greatest respect to economic consequences.

sub=subset(stormData,
           select=c("EVTYPE","PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"),
           (!is.na(PROPDMG)) & (!is.na(CROPDMG)) &
           ((PROPDMG > 0) | (CROPDMG > 0)))

Replace means of CROPDMGEXP and PROPDMGEXP by rules:

B -> 1,000,000,000

M,m -> 1,000,000

K,k -> 1,000

H,h -> 100

=,-,?, blank -> 0

1-8 -> 1

unique(sub$CROPDMGEXP)
## [1]   M K m B ? 0 k
## Levels:  ? 0 2 B k K m M
sub$CROPDMGEXP[sub$CROPDMGEXP == "?" | sub$CROPDMGEXP == ""]="0"
 
CROPDMGDol= mapvalues(sub$CROPDMGEXP,
                from=c("M","K","m","B","k","0"),
                to=c(1e6,1e3,1e6,1e9,1e3,0e0))

unique(sub$PROPDMGEXP)
##  [1] K M B m   + 0 5 6 4 h 2 7 3 H -
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
sub$PROPDMGEXP[sub$PROPDMGEXP == "+" | 
            sub$PROPDMGEXP == "" |
            sub$PROPDMGEXP == "-"]="0"


PROPDMGDol = mapvalues(sub$PROPDMGEXP,
from=c("K","M", "B","m","5","6","4","2","3","h","7","H","1","8","0"), 
#to=c(1e3,1e6,1e9,1e6, 1e5,1e6,1e4,1e2,1e3,1e2,1e7,1e2, 1e1,1e8,0e0))
to=c(1e3,1e6,1e9,1e6, 1e1,1e1,1e1,1e1,1e1,1e2,1e1,1e2, 1e1,1e1,0e0))

#Create new columns with exact amounts for property and corp damage
sub$PROP=as.numeric(as.vector(PROPDMGDol))*sub$PROPDMG
sub$CROP=as.numeric(as.vector(CROPDMGDol))*sub$CROPDMG


#Group data by variable “EVTYPE” and calculate economic consequences.
sub=summaryBy(CROP + PROP ~ EVTYPE, data=sub, FUN=sum) 

# Order the number of PROP.sum + CROP.sum by function arrange, using decending method
sub <- arrange(sub, desc(PROP.sum + CROP.sum))

#top 10 events which have the greatest economic consequences
economic=sub[1:10,]

Results

Question 1

health 
##               EVTYPE FATALITIES.sum INJURIES.sum
## 1            TORNADO           5633        91346
## 2     EXCESSIVE HEAT           1903         6525
## 3          TSTM WIND            504         6957
## 4              FLOOD            470         6789
## 5          LIGHTNING            816         5230
## 6               HEAT            937         2100
## 7        FLASH FLOOD            978         1777
## 8          ICE STORM             89         1975
## 9  THUNDERSTORM WIND            133         1488
## 10      WINTER STORM            206         1321

Draw barplots

par(mfrow = c(1, 2))
barplot(health$FATALITIES.sum, 
        names.arg = health$EVTYPE, 
        main = "Fatalities", 
        ylab = "fatalities", 
        cex.axis = 0.7, col="blue",
        cex.names = 0.7, las = 2)
barplot(health$INJURIES.sum, names.arg = health$EVTYPE, main = "Injuries", 
        ylab = "injuries", 
        cex.axis = 0.7, col="red",
        cex.names = 0.7, las = 2)

Question 2

economic 
##               EVTYPE    CROP.sum     PROP.sum
## 1              FLOOD  5661968450 144657709800
## 2  HURRICANE/TYPHOON  2607872800  69305840000
## 3            TORNADO   414953110  56937161502
## 4        STORM SURGE        5000  43323536000
## 5               HAIL  3025954450  15732267520
## 6        FLASH FLOOD  1421317100  16140812396
## 7            DROUGHT 13972566000   1046106000
## 8          HURRICANE  2741910000  11868319010
## 9        RIVER FLOOD  5029459000   5118945500
## 10         ICE STORM  5022113500   3944927810

Draw barplots

barplot(economic$PROP.sum, 
        names.arg = economic$EVTYPE, 
        main = "Property and crop damage",
        ylab = "Value of economic damages",
        cex.axis = 0.7,cex.names = 0.7,
        las = 2,  col="red")
barplot(economic$CROP.sum,
        cex.axis = 0.7,cex.names = 0.7,
        las = 2,col="blue",add=T)

Conclusion

1.Type of event “TORNADO” is most harmful with respect to population health.

2.Type of event “FLOOD” has the greatest economic consequences.