Reproducible Research: Peer Assessment 2

Title

Exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and figuring out the economic / social impact of various events.

Synopsis

This analysis is done on the National Weather Service Storm dataset which has been collected starting in the year 1950 till November 2011. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The following specific questions have been explored in this ananlysis.

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

For this analysis, the complete dataset has been read in a dataframe called “data”. Then the “EVTYPE variable has been cleaned based on the given event types. After that, the events which caused maximum fatalities / injuries have been measured and plotted.

Also, the economic effect of these events have been indentified and plotted. This is done based on the “PRODDMG” and “CROPDMG” variables.

Data Processing

The storm database can be downloaded from the following location: Storm Data [47Mb]

The file is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. For the analysis, it has already been downloaded and save locally. Read the data in a dataframe and have a look at the content.

setwd("/Users/promisinganuj/Data/Technical/R Language")
data=read.csv(bzfile("repdata-data-StormData.csv.bz2","rt"),header=TRUE)
# load(".RData")
dim(data)

## [1] 902297     37

names(data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

There are 37 columns in the dataset as shown above. For this analysis, the following fields would be required:

EVTYPE: Environment Type. 48 distinct types.
FALATITIES: Total fatalities in number because of the corresponding event
INJURIES: Total human injuries in number because of the corresponding event
PROPDMG: Property Damage. Rounded to 3 significant digits
PROPDMGEXP: Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
CROPDMG: Crop Damage. Rounded to 3 significant digits.
CROPDMGEXP: Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

For more details about the variable, please refer to the copybook at: Storm Data Documentation

Now check how many unique event types are there in the dataset:

dim(as.data.frame(table(data$EVTYPE)))

## [1] 985   2

By sneaking at the event type values, it is evident that there are redundencies in the event names such as “FLOOD”, “Flood”, “Floods” etc. Lets try to cleanup the variable names.

# Changing values to uppercase
data <- transform(data, EVTYPE = toupper(EVTYPE))
# AVALANCHE
data[data$EVTYPE == "AVALANCE", ]$EVTYPE <- "AVALANCHE"
# COASTAL FLOOD
cFlood <- c("COASTAL FLOODING", "COASTAL FLOODING/EROSION", "COASTAL STORM", "COASTAL SURGE", "COASTAL/TIDAL FLOOD", "COASTALSTORM", "COASTALSTORM", "COASTALFLOOD")
data[data$EVTYPE %in% cFlood, ]$EVTYPE <- "COASTAL FLOOD"
# COLD/WIND CHILL
cCold <- c("COLD","COLD WIND CHILL TEMPERATURES","COLD WEATHER","COLD TEMPERATURES","COLD TEMPERATURE","COLD AND FROST", "COLD AIR FUNNEL")
data[data$EVTYPE %in% cCold, ]$EVTYPE <- "COLD/WIND CHILL"
# DROUGHT
data[data$EVTYPE == "DROUGHT/EXCESSIVE HEAT", ]$EVTYPE <- "DROUGHT"
# EXCESSIVE HEAT
data[data$EVTYPE == "EXTREME HEAT", ]$EVTYPE <- "EXCESSIVE HEAT"
# EXTREME COLD/WIND CHILL
cChill <- c("EXTREME COLD","EXTREME WIND CHILL","EXTREME WINDCHILL","EXTREME WINDCHILL TEMPERATURES","EXTREME/RECORD COLD")
data[data$EVTYPE %in% cChill, ]$EVTYPE <- "EXTREME COLD/WIND CHILL"
# FLASH FLOOD
cFlashFlood <- c("FLASH FLOOD - HEAVY RAIN","FLASH FLOOD FROM ICE JAMS","FLASH FLOOD/FLOOD","FLASH FLOODING","FLASH FLOODING/FLOOD","FLASH FLOODS","FLASH FLOOODING")
data[data$EVTYPE %in% cFlashFlood, ]$EVTYPE <- "FLASH FLOOD"
# THUNDERSTORM WIND
data[grep("^THUN.+ST[O,R,M]{3}.+WIND", data$EVTYPE, value=F),]$EVTYPE = "THUNDERSTORM WIND"
data[grep("TSTM WIND", data$EVTYPE, value=F),]$EVTYPE = "THUNDERSTORM WIND"

Now that the base data has been cleaned, the actual processing can now be done.

Result

1. The most harmful events to human health.

For answering this statement, there are two key variables, “FATALITIES” and “INJURIES” which have already been described above. Let’s try to aggreate and plot these values for different event types.

fatalities <- aggregate(FATALITIES ~ EVTYPE, data = data, FUN = sum)
fatalities.order <- fatalities[order(fatalities$FATALITIES, decreasing = TRUE), ]
head(fatalities.order)

##                EVTYPE FATALITIES
## 658           TORNADO       5633
## 98     EXCESSIVE HEAT       1999
## 114       FLASH FLOOD       1018
## 212              HEAT        937
## 387         LIGHTNING        816
## 648 THUNDERSTORM WIND        743

Based on above, it is evident that maximum fatalities are caused by “TORNADO” followed distinctly by “EXCESSIVE HEAD” and “FLASH FLOODS”. Now, checking the trend against “injuries”.

injuries <- aggregate(INJURIES ~ EVTYPE, data = data, FUN = sum)
injuries.order <- injuries[order(injuries$INJURIES, decreasing = TRUE), ]
head(injuries.order)

##                EVTYPE INJURIES
## 658           TORNADO    91346
## 648 THUNDERSTORM WIND     9478
## 123             FLOOD     6789
## 98     EXCESSIVE HEAT     6680
## 387         LIGHTNING     5230
## 212              HEAT     2100

Again, the maximum injuries are caused by “TORNADO”. Let’s plot the first 10 events causing maximum fatalities and injuries respectively.

par(mfrow = c(2, 1), cex = .75)
barplot(fatalities.order[1:10, 2], col = rainbow(10), ylab = "Fatalities Count", main = "10 natural events cause most fatalities")
legend("topright", legend = fatalities.order[1:10, 1], cex = 0.5, fill = rainbow(10))
barplot(injuries.order[1:10, 2], col = rainbow(10), ylab = "Injuries Count", main = "10 natural events cause most people injuries")
legend("topright", legend = injuries.order[1:10, 1], cex = 0.5, fill = rainbow(10))

The cumulative effect can also be shown as:

most.harmful <- aggregate(INJURIES + FATALITIES ~ EVTYPE, data = data, FUN = sum)
names(most.harmful)[2] <- "COUNT"
most.harmful.order <- most.harmful[order(most.harmful$COUNT, decreasing = TRUE), ]
head(most.harmful.order)

##                EVTYPE COUNT
## 658           TORNADO 96979
## 648 THUNDERSTORM WIND 10221
## 98     EXCESSIVE HEAT  8679
## 123             FLOOD  7259
## 387         LIGHTNING  6046
## 212              HEAT  3037

Plotting the cumulative effect as well:

barplot(most.harmful.order[1:10, 2], col = rainbow(10), ylab = "Total Count (Fatalities + Injuries)", main = "10 natural events cause most people injuries and/or fatalities")
legend("topright", legend = most.harmful.order[1:10, 1], cex = 0.5, fill = rainbow(10))

Based on the above observation, it is evident that “TORNADO” is the most harmful event individually and as a whole.

2. The most harmful events to ecomony.

For this analysis, the relevant columns are “EVTYPE”, PRODDMG“,”PROPDMGEXP“,”CROPDMG" and “CROPDMGEXP”. Lets have a look at the “PROPDMG” and “CROPDMG” variables.

unique(data$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

as.data.frame(table(data$PROPDMGEXP))

##    Var1   Freq
## 1       465934
## 2     -      1
## 3     ?      8
## 4     +      5
## 5     0    216
## 6     1     25
## 7     2     13
## 8     3      4
## 9     4      4
## 10    5     28
## 11    6      4
## 12    7      5
## 13    8      1
## 14    B     40
## 15    h      1
## 16    H      6
## 17    K 424665
## 18    m      7
## 19    M  11330

unique(data$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

as.data.frame(table(data$CROPDMGEXP))

##   Var1   Freq
## 1      618413
## 2    ?      7
## 3    0     19
## 4    2      1
## 5    B      9
## 6    k     21
## 7    K 281832
## 8    m      1
## 9    M   1994

As mentioned in the “Data Processing” section, there are only three valid values for “%EXP” variables which are “K”, “M” and “B”. As it can be seen from above data, the values are in mixed cases, let’s change them to uppercase and calculate the actual damage values.

data <- transform(data, PROPDMGEXP = toupper(PROPDMGEXP))
data <- transform(data, CROPDMGEXP = toupper(CROPDMGEXP))
data[data$PROPDMGEXP == "K", ]$PROPDMG <- data[data$PROPDMGEXP == "K", ]$PROPDMG * 1e03
data[data$PROPDMGEXP == "M", ]$PROPDMG <- data[data$PROPDMGEXP == "M", ]$PROPDMG * 1e06
data[data$PROPDMGEXP == "B", ]$PROPDMG <- data[data$PROPDMGEXP == "B", ]$PROPDMG * 1e09
data[data$CROPDMGEXP == "K", ]$CROPDMG <- data[data$CROPDMGEXP == "K", ]$CROPDMG * 1e03
data[data$CROPDMGEXP == "M", ]$CROPDMG <- data[data$CROPDMGEXP == "M", ]$CROPDMG * 1e06
data[data$CROPDMGEXP == "B", ]$CROPDMG <- data[data$CROPDMGEXP == "B", ]$CROPDMG * 1e09

As we are interested in overall economic consequences, it makes sense to add the property and crop damage figures to see the overall trend. Let’s make a plot to summarise the 10 most economically harmful events.

total.damage <- aggregate(CROPDMG + PROPDMG ~ EVTYPE, data = data, sum)
names(total.damage)[2] <- "TOTALDMG"
total.damage.order <- total.damage[order(total.damage$TOTALDMG, decreasing = TRUE), ]
head(total.damage.order)

##                EVTYPE     TOTALDMG
## 123             FLOOD 150319678257
## 341 HURRICANE/TYPHOON  71913712800
## 658           TORNADO  57352114049
## 565       STORM SURGE  43323541000
## 181              HAIL  18758221521
## 114       FLASH FLOOD  18168929327

barplot(total.damage.order[1:10, 2], col = rainbow(10), ylab = "Total Damage", main = "10 most economically harmful natural events")
legend("topright", legend = total.damage.order[1:10, 1], fill = rainbow(10), cex = 0.5)

Based on the above plot, it is evident that “flooding” has most severe economic consequences followed by “hurricanes” and “tornados”.