Synopsis

Extreme weather events can be harmful and costly for communities through causing fatalities, injuries, property and crop damage. This analysis uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database, characterizes major storms and weather events in the Unitied States from 1950 to 2011, to explore which types of events (1) are most harmful with respect to population health and (2) have the greatest economic consequences.

After classifying the event names to their respective groups, and analysing the top 14 groups, Flood is the most damaging of all the events in terms of total property damage and total crop damage. Heat Related events have the most devastating effects to the crop.

After classifying the event names to their respective groups, and analysing the top 14 groups, Tornado is the most damaging of all the events in terms of total fatalities and total injuries.

Data Processing

1) Raw Data

Data Accessed: September 19, 2014

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

2) Downloading Data

The following code checks if the data file is already existing on the ./data folder, if not then it downloads the data file and save it as repdata-data-StormData.csv.bz2 in the ./data directory.

fileNameZip <- "repdata-data-StormData.csv.bz2"
fileDestZip <- paste(".","data", fileNameZip, sep = "/")
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" ## Be sure to use http, instead of https in the file URL since it gives errors in .Rmd files.

if (file.exists(fileNameZip)) {
        message("Data file already exists in the current working directory")
} else {
        message("Download the zipped data file into './data' directory")
        if (!file.exists("./data")) {
                dir.create("./data")
        }
        download.file(fileURL, fileDestZip, method = "auto")
}
## Data file already exists in the current working directory

3) Reading Data

Since the file compressed via the bzip2 algorithm, the size of unzipped data file is large (548Mb). Thus, it is better to read the data directly from the zipped file.

read.csv is used to read the data, since it is in a csv format with first row containing the column names so no additional parameters needed to be added.

Observing the dimension of the data, and the column names so that the required columns can be selected for the analysis.

dataRaw<- read.csv("./repdata-data-StormData.csv.bz2")
dim(dataRaw)
## [1] 902297     37
colnames(dataRaw)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

4) Subsetting and Cleaning the raw data

Before applying the transformations, we include the libraries which may be needed for subsetting and transformations.

library(plyr)
library(reshape2)
library(xtable)
library(ggplot2)

Since the analysis concerns majorily with a few set of columns so subsetting the whole data with the selected columns i.e. "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"

requiredCOl <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data<-dataRaw[,requiredCOl]

Looking for the data if it consist of any NA values

colSums(is.na(data))
##     EVTYPE FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG 
##          0          0          0          0          0          0 
## CROPDMGEXP 
##          0

Since data does not contains any NA values so we can proceed with the further analysis

Some of the columns, which need to be transformed include * PROPDMG: Property damange amount. * CROPDMG: Crop damange amount. * PROPDMGEXP: Unit of property damage. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, “B” for billions and “” for non. If additional precision is available, it may be provided in the narrative part of the entry. * CROPDMGEXP: Unit of crop damange. Same logical alphabetical units used in PROPDMGEXP applies. + According to the original data documentation there ar 4 units for PROPDMG and CROPDMG; however, computations clearly shows that there are some other alphabetical units such as +, -, ?, h and numbers (instead of letter). So, this analysis interprests signs as non units, and h as hundreds.

Mapping the character symbols to their numerical counterparts for PROPDMGEXP variable

## visualizing the characters in the PROPDMGEXP variable
table(data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
## mapping each character to its numeric counterpart and replacing them by `chartr`
## function
data$PROPDMGEXP<-chartr("-?+hHkKmMbB", "11122336699", data$PROPDMGEXP)

## Some rows have empty space as the value so that is assumed to be 0
data$PROPDMGEXP[data$PROPDMGEXP==""]<-0

Similar mapping is done for ``

## visualizing the characters in the CROPDMGEXP variable
table(data$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
## mapping each character to its numeric counterpart and replacing them by `chartr`
## function
data$CROPDMGEXP<-chartr("?kKmMbB", "1336699", data$CROPDMGEXP)

## Some rows have empty space as the value so that is assumed to be 0
data$CROPDMGEXP[data$CROPDMGEXP==""]<-0

Create variable PropDamage and CropDamage which contains the values after multiplying \(variableDMG * 10^variableDMGEXP\). PropDamage and CropDamage values are calculated in millions.

data$PropDamage<-data$PROPDMG*(10^(as.numeric(data$PROPDMGEXP)-6))
data$CropDamage<-data$CROPDMG*(10^(as.numeric(data$CROPDMGEXP)-6))

On visualizing the column EVTYPE in the data, it is noted it contains 985 different unique levels, so all levels were analysed.

length(levels(data$EVTYPE))
## [1] 985

So a set of keywords were selected and were mapped to a Event group that describe them the best

The mapping of the keywords were done as follows

keywords1 <- c("(.*FLOOD.*|.*fld.*)", "(.*TORN.*|.*WATERSPOUT.*)", "(.*HURRICANE.*|.*TYPHOON.*)", "(.*THUNDE.*|.*STORM.*|.*TROPICAL DEPRESSION.*)", ".*FIRE.*", "(.*ICY.*|.*ICE.*|.*FROST.*|.*SNOW.*|.*HAIL.*|.*FREEZ.*|.*BLIZ.*|.*WINT.*|.*GLAZE.*|.*COLD.*)", "(.*WIND.*|.*DUST*.)", "(.*RAIN.*|.*PRECIPITATION.*|.*WET.*|.*HEAVY MIX.*|.*PRECIP.*)", "(.*HEAT.*|.*DRY.*|.*DROUGHT.*|.*WARM.*)",  ".*FOG.*", "(.*TIDE.*|.*SURF.*|.*SEICHE.*|.*TSUNAMI.*|.*COASTAL SURGE.*|.*HIGH WATER.*|.*CURRENT.*|.*SEAS.*|.*MARINE.*)", "(.*AVALANCHE.*|.*EROSION.*|.*SLIDE.*|.*Landslump.*)","(.*LIGHTNING.*)", "(.*DAM BREAK.*|.*VOLCANIC ASH.*|.*OTHER.*)")

replace1 <- c("FLOOD", "TORNADO", "HURRICANE", "STORM", "FIRE", "WINTER RELATED", "WIND", "RAIN", "HEAT RELATED", "FOG", "OCEAN RELATED", "LAND RELATED","LIGHTNING RELATED", "OTHER")

table1 <- xtable(cbind(keywords1, replace1))
print(table1, type = "html")
keywords1 replace1
1 (.FLOOD.|.fld.) FLOOD
2 (.TORN.|.WATERSPOUT.) TORNADO
3 (.HURRICANE.|.TYPHOON.) HURRICANE
4 (.THUNDE.|.STORM.|.TROPICAL DEPRESSION.) STORM
5 .FIRE. FIRE
6 (.ICY.|.ICE.|.FROST.|.SNOW.|.HAIL.|.FREEZ.|.BLIZ.|.WINT.|.GLAZE.|.COLD.) WINTER RELATED
7 (.WIND.|.DUST.) WIND
8 (.RAIN.|.PRECIPITATION.|.WET.|.HEAVY MIX.|.PRECIP.) RAIN
9 (.HEAT.|.DRY.|.DROUGHT.|.WARM.) HEAT RELATED
10 .FOG. FOG
11 (.TIDE.|.SURF.|.SEICHE.|.TSUNAMI.|.COASTAL SURGE.|.HIGH WATER.|.CURRENT.|.SEAS.|.MARINE.) OCEAN RELATED
12 (.AVALANCHE.|.EROSION.|.SLIDE.|.Landslump.) LAND RELATED
13 (.LIGHTNING.) LIGHTNING RELATED
14 (.DAM BREAK.|.VOLCANIC ASH.|.OTHER.) OTHER

This mapping was run on the EVTYPE column to reduce the levels of the event, which was reduced to 162 seperate events.

eventType<-as.character(data$EVTYPE)
for (i in seq_along(keywords1)){
         found<-grepl(keywords1[i], eventType, ignore.case = TRUE)
         eventType[found]<-replace1[i]
}
data$EVTYPE<-as.factor(eventType)
length(levels(data$EVTYPE))
## [1] 162

Now these rows are summarized by based on their Event type by summing their PropDamage, CropDamage, FATALITIES and INJURIES as totalPropDamage, totalCropDamage, totalFatalities and totalInjuries. Two new columns are added totalDamage (as sum of totalPropDamage and totalCropDamage) and healthHarm (as sum of totalFatalities and totalInjuries)

finalCol<-c("EVTYPE")
summarizedRes<-ddply(data, finalCol, summarize, totalPropDamage=sum(PropDamage), totalCropDamage=sum(CropDamage), totalFatalities=sum(FATALITIES), totalInjuries=sum(INJURIES))
summarizedRes$totalDamage <- summarizedRes$totalPropDamage + summarizedRes$totalCropDamage
summarizedRes$healthHarm<-summarizedRes$totalFatalities + summarizedRes$totalInjuries

hist(log(summarizedRes$totalDamage), breaks = 10, main = "Histogram for Distribution of Total Economic Damage", xlab = expression("log(Total Economic Damage(in million Dollar scale))"), col="lightblue4")

plot of chunk summarizeByEvtType

hist(log(summarizedRes$healthHarm), breaks = 10, , main = "Histogram for Distribution of Total Health Damage", xlab = expression("log(Total Health Damage)"), col="lightblue4")

plot of chunk summarizeByEvtType

Visualizing from the bimodal structure of the graph for total damage and health harm the event type cause, the event type can be divided to two classifications. ### 5) Analysis #### 5.1) Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

First order the summarized data on descending order based on the healthHarm attribute. It is noticed that maximum harm is visible in the first 14 type of event types.

So melt the data based on the totalFatalities and totalInjuries and then plot EVTYpe with the healthHarm, filling in according to the melted column.

harm<-summarizedRes[order(summarizedRes$healthHarm, decreasing = TRUE),]
head(harm[,c("EVTYPE","totalFatalities","totalInjuries", "healthHarm")], 14)
##                EVTYPE totalFatalities totalInjuries healthHarm
## 145           TORNADO            5664         91436      97100
## 24       HEAT RELATED            3181          9275      12456
## 14              FLOOD            1553          8683      10236
## 160              WIND             983          8909       9892
## 77              STORM             635          6691       7326
## 42  LIGHTNING RELATED             817          5231       6048
## 161    WINTER RELATED             819          4821       5640
## 53      OCEAN RELATED             795           924       1719
## 12               FIRE              90          1608       1698
## 36          HURRICANE             135          1333       1468
## 15                FOG              80          1076       1156
## 38       LAND RELATED             268           225        493
## 55               RAIN             102           306        408
## 54              OTHER               8             4         12
harmFinal<- melt(harm[1:14,], id=c("EVTYPE", "healthHarm"), measure.vars=c("totalFatalities", "totalInjuries"))
ggplot(harmFinal, aes(x = reorder(EVTYPE, healthHarm), y = value, fill=variable)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Harm on Public Health by Generally Equivalent Event Types", y = "Total Harm Puplic Health", x = "Generally Equivalent Events Types") +
  scale_fill_manual(values = c("lightblue4", "lightblue3"), name = "Harm Type", breaks = c("totalFatalities", "totalInjuries"), labels = c("Fatalities", "Injuries"))

plot of chunk popHealthAnalysis

As observed, Tornado has the biggest impact on public health with the total harm of 97100, with 5664 fatalities and 91436 injuries

5.2) Across the United States, which types of events have the greatest economic consequences?

First order the summarized data on descending order based on the healthHarm attribute. It is noticed that maximum harm is visible in the first 14 type of event types.

So melt the data based on the totalPropDamage and totalPropDamage and then plot EVTYpe with the totalDamage, filling in according to the melted column.

Damage<-summarizedRes[order(summarizedRes$totalDamage, decreasing = TRUE),]
head(harm[,c("EVTYPE","totalPropDamage","totalCropDamage", "totalDamage")], 14)
##                EVTYPE totalPropDamage totalCropDamage totalDamage
## 145           TORNADO       5.861e+04         417.462   5.903e+04
## 24       HEAT RELATED       1.073e+03       14877.060   1.595e+04
## 14              FLOOD       1.683e+05       12388.597   1.807e+05
## 160              WIND       1.058e+04        1321.333   1.190e+04
## 77              STORM       7.327e+04        6406.889   7.967e+04
## 42  LIGHTNING RELATED       9.304e+02          12.092   9.425e+02
## 161    WINTER RELATED       1.779e+04        6787.134   2.457e+04
## 53      OCEAN RELATED       2.575e+02           0.020   2.575e+02
## 12               FIRE       8.502e+03         403.282   8.905e+03
## 36          HURRICANE       8.536e+04        5516.118   9.087e+04
## 15                FOG       2.283e+01           0.000   2.283e+01
## 38       LAND RELATED       3.320e+02          20.017   3.520e+02
## 55               RAIN       3.233e+03         953.153   4.186e+03
## 54              OTHER       1.558e+00           1.034   2.592e+00
DamageFinal<- melt(Damage[1:14,], id=c("EVTYPE", "totalDamage"), measure.vars=c("totalPropDamage", "totalCropDamage"))
ggplot(DamageFinal, aes(x = reorder(EVTYPE, totalDamage), y = value, fill=variable)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ylab("Total number in Million dollars") +
  xlab("Generally Equivalent Events Types") +
  ggtitle("Total number of damages per weather event") +
  scale_fill_manual(values = c("lightblue4", "lightblue3"), name = "Damage Type", breaks = c("totalPropDamage", "totalCropDamage"), labels = c("Property Damage", "Crop Damage"))

plot of chunk popDamageAnalysis

As observed, Flood are the cause of maximum economic damages of 180700 Million dollar