Synopsis

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Analysis Goal

The basic goal of this exercise is to explore the NOAA Storm Database and answer some basic questions about severe weather events. We would primarily address following two questions through some basic analysis.

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

library(data.table)
library(xtable)
library(ggplot2)
library(lattice)
library(plyr)
library(knitr)
library(lubridate)


Attaching package: 'lubridate'

The following object is masked from 'package:plyr':

    here

The following objects are masked from 'package:data.table':

    hour, mday, month, quarter, wday, week, yday, year

options(scipen=999)

Downloading Data

Download the storm data from web using the script download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2") or manually and save the file as “StormData.csv.bz2” in your present working directory of current R-session. Here, we would download the file using script only.

library(RCurl)
file<-"StormData.csv.bz2"
url<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, file, mode = "wb")

Loading Data

In R reading in a .bz2 file can be done using read.table() command as we have done here to read “StormData.csv.bz2”. Sometime we may have to ise bzfile("file.bz2", "rt") to uncompress file and then read.

stormData<-read.table(file, sep=",", header=TRUE)

We have used echo=TRUE to print the code chunk in the report and cache=TRUE with the r-chunk to reduce re-production time of large data processing when the processing being done already, no need to run the code again.

Basic Analysis

It is always good to know some basics about the contents of the data using dim(),head(), tail(), summary() or str() like functions before proceeding further.

str(stormData)

'data.frame':   902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
 $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
 $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
 $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
 $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
 $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
 $ COUNTYENDN: logi  NA NA NA NA NA NA ...
 $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
 $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
 $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
 $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
 $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
 $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
 $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
 $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
 $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
 $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
 $ LATITUDE_E: num  3051 0 0 0 0 ...
 $ LONGITUDE_: num  8806 0 0 0 0 ...
 $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

There are 902,297 observations and 37 columns in the Storm Data. str() also produces basic summaries of each columns (variables) in the data along with type of variables like “numeric”, integer“,”factor“,”character“,”date" etc. so that we can make right transformation of variables when needed.

Data Transformation

For this analysis we would require only fewer variables/columns from “stormData” to analyze severe weather events viz. EVTYPE - Type of events, FATALITIES - Fatalities, BGN_DATE - Begining date, PROPDMG-Property Damage. We will subset the data for faster processing.

subStormData<-stormData[,c("EVTYPE", "FATALITIES", "BGN_DATE", "PROPDMG")]
str(subStormData)

'data.frame':   902297 obs. of  4 variables:
 $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
 $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
 $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
 $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...

Lets convert the “BGN_DATE” to readable date format from factor for calculation.

subStormData$BGN_DATE <- strptime(subStormData$BGN_DATE, format = "%m/%d/%Y 0:00:00")
str(subStormData$BGN_DATE)

 POSIXlt[1:902297], format: "1950-04-18" "1950-04-18" "1951-02-20" "1951-06-08" ...

subStormData$Year<-as.factor(year(subStormData$BGN_DATE))
str(subStormData)

'data.frame':   902297 obs. of  5 variables:
 $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
 $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
 $ BGN_DATE  : POSIXlt, format: "1950-04-18" "1950-04-18" ...
 $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
 $ Year      : Factor w/ 62 levels "1950","1951",..: 1 1 2 2 2 2 2 3 3 3 ...

Results

Analysis

1. Across the United States, which types of events are most harmful with respect to population health?

severity <- as.data.frame(aggregate(subStormData$FATALITIES ~ subStormData$EVTYPE, FUN = sum))
colnames(severity) <- c("Event.Type","Fatalities")
severity<-severity[order(severity$Fatalities, decreasing=TRUE),]
rownames(severity)<-NULL

Graphical representation of Top 10 harmful events with respect to population health.

y_max<-max(severity$Fatalities)
ggplot(data = head(severity,10), aes(x = Event.Type, y = Fatalities, ymax=y_max+100))+
        geom_bar(aes(fill=Event.Type),  stat = "Identity",  position = position_dodge(width=0.9)) +
        xlab('Event Type') +  ylab("Fatalities") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+
        labs(title="Impact on US Population by Natural Events")+
        geom_text(aes(label = Fatalities), size = 3, position = "stack", angel=90)

plot of chunk unnamed-chunk-9

Top 20 events that are most harmful with respect to population health are

severity$Fatalities<-formatC(severity$Fatalities, big.mark=",", format="d")
kable(head(severity,20), format = "pandoc", padding = 0,row.names = FALSE, caption="Nationawide Top 20 Harmful Events")



Table: Nationawide Top 20 Harmful Events

Event.Type               Fatalities
-----------------------  ----------
TORNADO                  5,633     
EXCESSIVE HEAT           1,903     
FLASH FLOOD              978       
HEAT                     937       
LIGHTNING                816       
TSTM WIND                504       
FLOOD                    470       
RIP CURRENT              368       
HIGH WIND                248       
AVALANCHE                224       
WINTER STORM             206       
RIP CURRENTS             204       
HEAT WAVE                172       
EXTREME COLD             160       
THUNDERSTORM WIND        133       
HEAVY SNOW               127       
EXTREME COLD/WIND CHILL  125       
STRONG WIND              103       
BLIZZARD                 101       
HIGH SURF                101

2. Across the United States, which types of events have the greatest economic consequences?

Summarizing property damages with respect to event type. “Economic.Loss”" is the total losses across US by event type.

ecDamage <- as.data.frame(aggregate(subStormData$PROPDMG ~ subStormData$EVTYPE, FUN = sum))
colnames(ecDamage) <- c("Event.Type","Economic.Loss")
ecDamage<-ecDamage[order(ecDamage$Economic.Loss, decreasing=TRUE),]
rownames(ecDamage)<-NULL
str(ecDamage)

'data.frame':   985 obs. of  2 variables:
 $ Event.Type   : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 153 856 170 760 244 464 786 359 972 ...
 $ Economic.Loss: num  3212258 1420125 1335966 899938 876844 ...

ecDamage$Eco_Loss_Thousand<-round(ecDamage$Economic.Loss/1000,2)

Due to big numbers in “Economic.Loss”, we would divide it by 1,000 to make more friendly representation and call that varibale as “Eco_Loss_Thousand”.

Graphical representation of Top 10 harmful events with respect to population health.

y_max<-max(ecDamage$Economic.Loss)
ggplot(data = head(ecDamage,10), aes(x = Event.Type, y = Eco_Loss_Thousand, ymax=y_max/1000+100))+
        geom_bar(aes(fill=Event.Type),  stat = "Identity",  position = position_dodge(width=0.9)) +
        xlab('Event Type') +  ylab("Economic Losses ( in thousands)") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))+
        labs(title="US Property Losses ('000) by Top 10 Harmful Events")+
        geom_text(aes(label = Eco_Loss_Thousand), size = 3, position = "stack", angle=90)

plot of chunk unnamed-chunk-12

Top 20 harmful events that are causing most of the property damages across the US.

ecDamage$Eco_Loss_Thousand<-format(ecDamage$Eco_Loss_Thousand, big.mark=",", format="d")
ecDamage$Economic.Loss<-format(ecDamage$Economic.Loss, big.mark=",", format="d")
kable(head(ecDamage,20), format = "pandoc", padding = 0,row.names = FALSE, caption="Nationawide Top 20 Harmful Events")



Table: Nationawide Top 20 Harmful Events

Event.Type            Economic.Loss  Eco_Loss_Thousand
--------------------  -------------  -----------------
TORNADO               3,212,258.16   3,212.26         
FLASH FLOOD           1,420,124.59   1,420.12         
TSTM WIND             1,335,965.61   1,335.97         
FLOOD                   899,938.48     899.94         
THUNDERSTORM WIND       876,844.17     876.84         
HAIL                    688,693.38     688.69         
LIGHTNING               603,351.78     603.35         
THUNDERSTORM WINDS      446,293.18     446.29         
HIGH WIND               324,731.56     324.73         
WINTER STORM            132,720.59     132.72         
HEAVY SNOW              122,251.99     122.25         
WILDFIRE                 84,459.34      84.46         
ICE STORM                66,000.67      66.00         
STRONG WIND              62,993.81      62.99         
HIGH WINDS               55,625.00      55.62         
HEAVY RAIN               50,842.14      50.84         
TROPICAL STORM           48,423.68      48.42         
WILD/FOREST FIRE         39,344.95      39.34         
FLASH FLOODING           28,497.15      28.50         
URBAN/SML STREAM FLD     26,051.94      26.05

Yearly Trends

Yearly_fatalities<-aggregate(subStormData$FATALITIES, by = list(subStormData$Year), 
                              FUN = "sum", na.rm = TRUE)
Yearly_damage<-aggregate(subStormData$PROPDMG, by = list(subStormData$Year), 
                                 FUN = "sum", na.rm = TRUE)
colnames(Yearly_fatalities)<-c("Year","fatalities")
colnames(Yearly_damage)<-c("Year","damage")
Yearly_losses<-merge(Yearly_fatalities, Yearly_damage, by="Year")
rownames(Yearly_losses)<-NULL
last_10Yrs<-tail(Yearly_losses, 10)
last_10Yrs$damage_K<-round(last_10Yrs$damage/1000,2)
y_max_1<-max(last_10Yrs$fatalities)+100
y_max_2<-max(last_10Yrs$damage_K)+100

library(grid)
library(gridExtra)
p1 <- 
    ggplot(data = last_10Yrs, aes(x = Year, y = fatalities, ymax=y_max_1))+
        geom_bar(aes(fill=Year),  stat = "Identity",  position = position_dodge(width=0.9)) +
        xlab('Year') +  ylab("Fatalities") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))+
        labs(title="Fatalities")+
        geom_text(aes(label = fatalities), size = 3, position = "stack", angle=90) 

p2 <- 
    ggplot(data = last_10Yrs, aes(x = Year, y = damage_K, ymax=y_max_2))+
        geom_bar(aes(fill=Year),  stat = "Identity",  position = position_dodge(width=0.9)) +
        xlab('Year') +  ylab("Property Damages") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))+
        labs(title="Property Damages ('000)")+
        geom_text(aes(label = damage_K), size = 3, position = "stack", angle=90) 
grid.arrange(p1, p2, ncol = 2, main = "Distribution of Losses due to different Events in the US")

plot of chunk unnamed-chunk-15

Findings

We can observe from the analysis that “Tornado” is the most harmful event, which causing maximum losses of properties and harming population health. There are significant amount of fatalities happened due to Excessive Heat and property losses due to Flash Floods, Flood and TSTM Winds.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Note that the echo = TRUE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

RepResearch2: U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database Analysis

Ariful Mondal

Wednesday, September 17, 2014