Reproducible Research: Peer Assessment 2

Synopsis

This research involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Across the United States, the types of storm events are most harmful with respect to population health are:

Across the United States, the types of storm events that have the greatest economic consequences are:

Data Processing

The R statistical computing and graphics environment was used to analyze the data and publish this report. This assignment requires the following R libraries.

library("xtable");
library("plyr");
library("ggplot2");

Loading the data

The following R code was used to download data archive, unzip archive and load the storm data into R.

tmp <-"data/repdata-data-StormData.csv.bz2";
storm <- read.csv(                                    
    bzfile(tmp,"repdata-data-StormData.csv"),         
    stringsAsFactors=FALSE,header=TRUE                
);                                                    
str(storm[,c('BGN_DATE','FATALITIES','INJURIES','PROPDMG','CROPDMG')]);
## 'data.frame':    902297 obs. of  5 variables:
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
summary(storm[,c('BGN_DATE','FATALITIES','INJURIES','PROPDMG','CROPDMG')]);
## Warning: closing unused connection 5 (data/repdata-data-StormData.csv.bz2)
##    BGN_DATE           FATALITIES     INJURIES         PROPDMG    
##  Length:902297      Min.   :  0   Min.   :   0.0   Min.   :   0  
##  Class :character   1st Qu.:  0   1st Qu.:   0.0   1st Qu.:   0  
##  Mode  :character   Median :  0   Median :   0.0   Median :   0  
##                     Mean   :  0   Mean   :   0.2   Mean   :  12  
##                     3rd Qu.:  0   3rd Qu.:   0.0   3rd Qu.:   0  
##                     Max.   :583   Max.   :1700.0   Max.   :5000  
##     CROPDMG     
##  Min.   :  0.0  
##  1st Qu.:  0.0  
##  Median :  0.0  
##  Mean   :  1.5  
##  3rd Qu.:  0.0  
##  Max.   :990.0

Preprocessing the data

The following R code was used to take subset of data. The subset contains the variables of interest, which are related to population health {fatalities and injuries} and economic consequences {property damage and crop damage}.

varNames <- c('BGN_DATE',  'EVTYPE', 'FATALITIES','INJURIES',
              'PROPDMGEXP','PROPDMG','CROPDMGEXP','CROPDMG');
df <- storm[storm$FATALITIES>0 | storm$INJURIES>0,varNames];  
df$BGN_DATE   <- as.Date(df$BGN_DATE,"%m/%d/%Y");
df$EVTYPE     <- toupper(df$EVTYPE);
df$PROPDMGEXP <- factor(toupper(df$PROPDMGEXP));
df$PROPDMG[df$PROPDMGEXP=='H'] <- df$PROPDMG[df$PROPDMGEXP=='H'] * 100;
df$PROPDMG[df$PROPDMGEXP=='K'] <- df$PROPDMG[df$PROPDMGEXP=='K'] * 1000;
df$PROPDMG[df$PROPDMGEXP=='M'] <- df$PROPDMG[df$PROPDMGEXP=='M'] * 1000000;
df$PROPDMG[df$PROPDMGEXP=='B'] <- df$PROPDMG[df$PROPDMGEXP=='B'] * 1000000000;
df$CROPDMGEXP <- factor(toupper(df$CROPDMGEXP));
df$CROPDMG[df$CROPDMGEXP=='H'] <- df$CROPDMG[df$CROPDMGEXP=='H'] * 100;
df$CROPDMG[df$CROPDMGEXP=='K'] <- df$CROPDMG[df$CROPDMGEXP=='K'] * 1000;
df$CROPDMG[df$CROPDMGEXP=='M'] <- df$CROPDMG[df$CROPDMGEXP=='M'] * 1000000;
df$CROPDMG[df$CROPDMGEXP=='B'] <- df$CROPDMG[df$CROPDMGEXP=='B'] * 1000000000;
df$ECONOMIC   <- df$PROPDMG + df$CROPDMG;
df$HEALTH     <- df$FATALITIES + df$INJURIES;
for( i in 1:dim(df)[1] ) {
    if(df$EVTYPE[i]=='FLASH FLOOD')       df$EVTYPE[i] = 'FLOOD';
    if(df$EVTYPE[i]=='TSTM WIND')         df$EVTYPE[i] = 'TROPICAL STORM';
    if(df$EVTYPE[i]=='EXCESSIVE HEAT')    df$EVTYPE[i] = 'HEAT';
    if(df$EVTYPE[i]=='HURRICANE/TYPHOON') df$EVTYPE[i] = 'HURRICANE';
}
varNames <- c('BGN_DATE','EVTYPE', 'FATALITIES','INJURIES',
              'HEALTH',  'PROPDMG','CROPDMG',   'ECONOMIC');
df <- df[,varNames]
str(df); summary(df);
## 'data.frame':    21929 obs. of  8 variables:
##  $ BGN_DATE  : Date, format: "1950-04-18" "1951-02-20" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 1 0 0 1 ...
##  $ INJURIES  : num  15 2 2 2 6 1 14 3 3 26 ...
##  $ HEALTH    : num  15 2 2 2 6 1 15 3 3 27 ...
##  $ PROPDMG   : num  25000 25000 2500 2500 2500 2500 25000 2500000 2500000 250000 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ECONOMIC  : num  25000 25000 2500 2500 2500 2500 25000 2500000 2500000 250000 ...
##     BGN_DATE             EVTYPE            FATALITIES       INJURIES     
##  Min.   :1950-01-03   Length:21929       Min.   :  0.0   Min.   :   0.0  
##  1st Qu.:1987-07-06   Class :character   1st Qu.:  0.0   1st Qu.:   1.0  
##  Median :1997-12-08   Mode  :character   Median :  0.0   Median :   1.0  
##  Mean   :1993-07-24                      Mean   :  0.7   Mean   :   6.4  
##  3rd Qu.:2004-05-02                      3rd Qu.:  1.0   3rd Qu.:   3.0  
##  Max.   :2011-11-30                      Max.   :583.0   Max.   :1700.0  
##      HEALTH          PROPDMG            CROPDMG            ECONOMIC       
##  Min.   :   1.0   Min.   :0.00e+00   Min.   :0.00e+00   Min.   :0.00e+00  
##  1st Qu.:   1.0   1st Qu.:0.00e+00   1st Qu.:0.00e+00   1st Qu.:0.00e+00  
##  Median :   2.0   Median :1.00e+04   Median :0.00e+00   Median :1.00e+04  
##  Mean   :   7.1   Mean   :5.68e+06   Mean   :2.76e+05   Mean   :5.96e+06  
##  3rd Qu.:   4.0   3rd Qu.:2.50e+05   3rd Qu.:0.00e+00   3rd Qu.:2.50e+05  
##  Max.   :1742.0   Max.   :1.00e+10   Max.   :1.51e+09   Max.   :1.00e+10

Results

The following R code ranks storm events by the number of injuries or fatalities that the generated.

head(arrange(as.data.frame(xtabs(FATALITIES~EVTYPE,df)),desc(Freq)));
##           EVTYPE Freq
## 1        TORNADO 5633
## 2           HEAT 2840
## 3          FLOOD 1448
## 4      LIGHTNING  816
## 5 TROPICAL STORM  562
## 6    RIP CURRENT  368
head(arrange(as.data.frame(xtabs(INJURIES~EVTYPE,df)),desc(Freq)));
##           EVTYPE  Freq
## 1        TORNADO 91346
## 2           HEAT  8625
## 3          FLOOD  8566
## 4 TROPICAL STORM  7297
## 5      LIGHTNING  5230
## 6      ICE STORM  1975

Now that the we know the top 6 storm events in terms of harm to the population, a histogram can be generated.

tmp <- head(arrange(as.data.frame(xtabs(HEALTH~EVTYPE,df)),desc(Freq)));
ck <- df$EVTYPE==tmp$EVTYPE[1] | df$EVTYPE==tmp$EVTYPE[2] |  
      df$EVTYPE==tmp$EVTYPE[3] | df$EVTYPE==tmp$EVTYPE[4] | 
      df$EVTYPE==tmp$EVTYPE[5] | df$EVTYPE==tmp$EVTYPE[6];
qplot(EVTYPE,data=df[ck,c('EVTYPE','HEALTH')],
      main ='Top Harmful Storm Events in the U.S (1950-2011)',
      xlab ='Storm Event Type',
      ylab ='Number of Fatalities or Injuries');

plot of chunk Results2 The following R code extracts ranks storm events by the value of the property damage and crop damage that was caused by event.

head(arrange(as.data.frame(xtabs(CROPDMG~EVTYPE,df)),desc(Freq)));
##           EVTYPE      Freq
## 1      HURRICANE 3.680e+09
## 2           HEAT 4.937e+08
## 3      HIGH WIND 3.520e+08
## 4 TROPICAL STORM 2.225e+08
## 5       WILDFIRE 1.821e+08
## 6          FLOOD 1.756e+08
head(arrange(as.data.frame(xtabs(PROPDMG~EVTYPE,df)),desc(Freq)));
##             EVTYPE      Freq
## 1          TORNADO 4.189e+10
## 2        HURRICANE 3.556e+10
## 3            FLOOD 1.053e+10
## 4   TROPICAL STORM 7.954e+09
## 5     WINTER STORM 5.221e+09
## 6 STORM SURGE/TIDE 4.004e+09

Now that the we know the top 6 storm events in terms of economic impact, a histogram can be generated.

tmp <- head(arrange(as.data.frame(xtabs(ECONOMIC~EVTYPE,df)),desc(Freq)))
ck <- df$EVTYPE==tmp$EVTYPE[1] | df$EVTYPE==tmp$EVTYPE[2] |  
      df$EVTYPE==tmp$EVTYPE[3] | df$EVTYPE==tmp$EVTYPE[4] | 
      df$EVTYPE==tmp$EVTYPE[5] | df$EVTYPE==tmp$EVTYPE[6];
qplot(EVTYPE,data=df[ck,c('EVTYPE','HEALTH')],
      main ='Top Economically Impactful Storm Events in the U.S (1950-2011)',
      xlab ='Storm Event Type',
      ylab ='Dollar Value')

plot of chunk Results4