NOAA Storm Weather Data Analysis

This is an analysis of the NOAA storm data. The data for this analysis came in the form of a comma-separated-value file compressed via the bzip2 at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.

Synopsis

Severe weather causes human casualties, property damages, crop damages, and disrupts lives. The NOAA’s database contains such events from 1950 to 2011. Based on the data, we can conclude that Tornado (91346), Thunderstorms (6957), Floods(6789), Excessive Heat(6525), and Lightning (5230) cause the most human fatalities. However when we analyze the cost of such severe events, we do see that Floods($150 Billion), Hurricanes ($71B), Tornados($57B), Storms($43B), Hails($18B) cause the most cost impact.

Data Processing

Let us remove the environmental variables

rm(list=ls()) # Remove everything from environment
cat("\014")   # Clear Console

# Load the necessary graphics packages
library(ggplot2);
#Load the dplyr package
library(dplyr);
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("/Users/rdoraiswamy/mygit/Reproducible_Research_Project/");

Let us get the file from the Web

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "repdata-data-StormData.csv.bz2");

Check the file size

file.size ("repdata-data-StormData.csv.bz2");
## [1] 49177144
# Read the CSV file and store the same
act <- read.table("repdata-data-StormData.csv.bz2", sep = ",", header = TRUE);
# Let's review the first rows and the structure of the data
head(act);
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
str(act);
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Since the Event types have mixed case, let us convert them to upper case

act$EVTYPE <- toupper(act$EVTYPE);

Let us capture only events of interest to us. We see that there are Fatalities, Injuries, Property Damage, and Crop Damaage. Let us calculate the fatalities by Event Type. I am converting the evtype to Upper Case so that we can summarize better.

agg_ev <- aggregate(act$FATALITIES, by=list((act$EVTYPE))
                    , sum);
str(agg_ev);
## 'data.frame':    898 obs. of  2 variables:
##  $ Group.1: chr  "   HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
##  $ x      : num  0 0 0 0 0 0 0 0 0 0 ...
colnames(agg_ev) <- c("Event_Type", "Fatalities");

Let us now compute the fatalities

fatal <- agg_ev[agg_ev$Fatalities>0,];
harmful <- fatal[order(-fatal$Fatalities),];

Now calculating the most harmful event details and then the top 10 events

cat("The most harmful event in the US is: ", harmful[1,1], " with "
    , harmful[1,2], " fatalities");
## The most harmful event in the US is:  TORNADO  with  5633  fatalities
#For our list let us load top 10 fatal events
top10 <- harmful[1:10,];
#GGPlot needs the x-axis as a factor to display in the right order that we need
top10$name <- factor(top10$Event_Type
                     , levels = top10$Event_Type[order(-top10$Fatalities)] );

Since we need to compute the impact amout, we will define a conversion function to compute the property and crop damages

mult <- function(x) { if (toupper(x) == "H") {return(100);}
  else if (toupper(x) == "K") { return(1000)}
  else if (toupper(x) == "M") { return(1000000)}
  else if (toupper(x) == "B") {return(1000000000)}
  else return(1);
}
#Now create a vector with the multiplier
propMultiplier <- sapply(act$PROPDMGEXP, function(x) mult(x));
cropMultiplier <- sapply(act$CROPDMGEXP, function(x) mult(x));

Compute the actual values based on the multipliers above:

act$TOTPROPDMG <- act$PROPDMG * propMultiplier;
act$TOTCROPDMG <- act$CROPDMG * cropMultiplier;

Compute the total damage by summing these up:

act$TOTDMG <- act$TOTPROPDMG + act$TOTCROPDMG;

Compute the Human impact for Fatalities and Injuries

humanImpact <- summarize(group_by(act, EVTYPE), totalDeath = sum(FATALITIES, na.rm = TRUE)
                   , totalInjury = sum(INJURIES, na.rm = TRUE));

Capture only non-zero Fatal or Injury data

nonZeroHI <- humanImpact[humanImpact$totalDeath>0 | humanImpact$totalInjury > 0 , ];
dim(nonZeroHI);
## [1] 205   3

Capture the deaths and injuries by event type

deaths <- nonZeroHI[order(-nonZeroHI$totalDeath), ];
deaths;
## Source: local data frame [205 x 3]
## 
##            EVTYPE totalDeath totalInjury
##             (chr)      (dbl)       (dbl)
## 1         TORNADO       5633       91346
## 2  EXCESSIVE HEAT       1903        6525
## 3     FLASH FLOOD        978        1777
## 4            HEAT        937        2100
## 5       LIGHTNING        816        5230
## 6       TSTM WIND        504        6957
## 7           FLOOD        470        6789
## 8     RIP CURRENT        368         232
## 9       HIGH WIND        248        1137
## 10      AVALANCHE        224         170
## ..            ...        ...         ...
injuries <- nonZeroHI[order(-nonZeroHI$totalInjury), ];
injuries;
## Source: local data frame [205 x 3]
## 
##               EVTYPE totalDeath totalInjury
##                (chr)      (dbl)       (dbl)
## 1            TORNADO       5633       91346
## 2          TSTM WIND        504        6957
## 3              FLOOD        470        6789
## 4     EXCESSIVE HEAT       1903        6525
## 5          LIGHTNING        816        5230
## 6               HEAT        937        2100
## 7          ICE STORM         89        1975
## 8        FLASH FLOOD        978        1777
## 9  THUNDERSTORM WIND        133        1488
## 10              HAIL         15        1361
## ..               ...        ...         ...

Sort the dataframe for plotting by decreasing order of deaths

deaths$EVTYPE <- factor(deaths$EVTYPE, levels = deaths$EVTYPE[order(-deaths$totalDeath)]);

Results

  1. Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health? The event categories with the top five highest death counts were TORNADO, EXCESSIVE HEAT, HEAT, FLASH FLOOD, and LIGHTNING. The event categories with the top five highest injury counts were TORNADO, THUNDERSTORM WIND, FLOOD, EXCESSIVE HEAT, and LIGHTNING.
df <- as.data.frame(deaths);
n <- 20;
c <- ggplot(df[1:n,], aes(x=df[1:n, "EVTYPE"], 
                          y=df[1:n, "totalDeath"] 
                          #, fill = df[1:n, "EVTYPE"]
                          )) ;
c <- c + ggtitle("Top 20 Fatalities by Event Type in US");
c <- c + labs(x = "Event Type", y = "Fatalities");
c <- c + theme(plot.background=element_rect(fill="lightblue"));
c <- c +  geom_text(aes(label= df[1:n, "totalDeath"]), size = 3
                    , vjust = -1
                    , position = "stack");
c <- c + geom_bar(stat = "identity");
#Axis lables need to be vertical
c <- c + theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(c);

Let us now plot the injuries.

injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE[order(-injuries$totalInjury)]);

df <- as.data.frame(injuries);
c <- ggplot(df[1:n,], aes(x=df[1:n, "EVTYPE"], 
                          y=df[1:n, "totalInjury"] 
                          #, fill = df[1:n, "EVTYPE"]
)) ;
c <- c + ggtitle("Top 20 Injuries by Event Type in US");
c <- c + labs(x = "Event Type", y = "Injuries");
c <- c + theme(plot.background=element_rect(fill="grey"));
c <- c +  geom_text(aes(label= df[1:n, "totalInjury"]), size = 3
                    , vjust = -1
                    , position = "stack");
c <- c + geom_bar(stat = "identity");
#Axis lables need to be vertical
c <- c + theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(c);

Compute the Dollar impact for Fatalities and Injuries

dollarImpact <- summarize(group_by(act, EVTYPE)
                          , totalCost = sum(TOTDMG/1000000, na.rm = TRUE));
dollarImpact$EVTYPE <- factor(dollarImpact$EVTYPE, levels = dollarImpact$EVTYPE[order(-dollarImpact$totalCost)]);
dollarImpact <- dollarImpact[order(-dollarImpact$totalCost), ];
#Capture only non-zero Cost data
nonZeroDI <- dollarImpact[dollarImpact$totalCost>0, ];
dim(nonZeroDI);
## [1] 397   2

Now we will plot the Top 20 costly events

  1. Across the United States, which types of events have the greatest economic consequences?

The event categories with the top five highest economic impact were FLOOD, HURRICANE (TYPHOON), TORNADO, STORM SURGE, and HAIL. Let us plot the death details we computed above.

df <- as.data.frame(nonZeroDI);
c <- ggplot(df[1:n,], aes(x=df[1:n, "EVTYPE"], 
                          y=df[1:n, "totalCost"] 
                          #, fill = df[1:n, "EVTYPE"]
)) ;
c <- c + ggtitle("Top 20 Costly Events by Event Type in US");
c <- c + labs(x = "Event Type", y = "Cost in $");
c <- c + theme(plot.background=element_rect(fill="lightgreen"));
c <- c +  geom_text(aes(label= df[1:n, "totalCost"]), size = 3
                    , vjust = -1
                    , position = "stack");
c <- c + geom_bar(stat = "identity");
#Axis lables need to be vertical
c <- c + theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(c);