Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal of this exercise is to explore the NOAA Storm Database and answer some basic questions about severe weather events. We would primarily address following two questions through some basic analysis.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(data.table)
library(xtable)
library(ggplot2)
library(lattice)
library(plyr)
library(knitr)
library(lubridate)
Attaching package: 'lubridate'
The following object is masked from 'package:plyr':
here
The following objects are masked from 'package:data.table':
hour, mday, month, quarter, wday, week, yday, year
options(scipen=999)
Download the storm data from web using the script download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2") or manually and save the file as “StormData.csv.bz2” in your present working directory of current R-session. Here, we would download the file using script only.
library(RCurl)
file<-"StormData.csv.bz2"
url<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, file, mode = "wb")
In R reading in a .bz2 file can be done using read.table() command as we have done here to read “StormData.csv.bz2”. Sometime we may have to ise bzfile("file.bz2", "rt") to uncompress file and then read.
stormData<-read.table(file, sep=",", header=TRUE)
We have used echo=TRUE to print the code chunk in the report and cache=TRUE with the r-chunk to reduce re-production time of large data processing when the processing being done already, no need to run the code again.
It is always good to know some basics about the contents of the data using dim(),head(), tail(), summary() or str() like functions before proceeding further.
str(stormData)
'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
$ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
$ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
$ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
$ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
$ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 902,297 observations and 37 columns in the Storm Data. str() also produces basic summaries of each columns (variables) in the data along with type of variables like “numeric”, integer“,”factor“,”character“,”date" etc. so that we can make right transformation of variables when needed.
For this analysis we would require only fewer variables/columns from “stormData” to analyze severe weather events viz. EVTYPE - Type of events, FATALITIES - Fatalities, BGN_DATE - Begining date, PROPDMG-Property Damage. We will subset the data for faster processing.
subStormData<-stormData[,c("EVTYPE", "FATALITIES", "BGN_DATE", "PROPDMG")]
str(subStormData)
'data.frame': 902297 obs. of 4 variables:
$ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
Lets convert the “BGN_DATE” to readable date format from factor for calculation.
subStormData$BGN_DATE <- strptime(subStormData$BGN_DATE, format = "%m/%d/%Y 0:00:00")
str(subStormData$BGN_DATE)
POSIXlt[1:902297], format: "1950-04-18" "1950-04-18" "1951-02-20" "1951-06-08" ...
subStormData$Year<-as.factor(year(subStormData$BGN_DATE))
str(subStormData)
'data.frame': 902297 obs. of 5 variables:
$ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ BGN_DATE : POSIXlt, format: "1950-04-18" "1950-04-18" ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ Year : Factor w/ 62 levels "1950","1951",..: 1 1 2 2 2 2 2 3 3 3 ...
severity <- as.data.frame(aggregate(subStormData$FATALITIES ~ subStormData$EVTYPE, FUN = sum))
colnames(severity) <- c("Event.Type","Fatalities")
severity<-severity[order(severity$Fatalities, decreasing=TRUE),]
rownames(severity)<-NULL
Graphical representation of Top 10 harmful events with respect to population health.
y_max<-max(severity$Fatalities)
ggplot(data = head(severity,10), aes(x = Event.Type, y = Fatalities, ymax=y_max+100))+
geom_bar(aes(fill=Event.Type), stat = "Identity", position = position_dodge(width=0.9)) +
xlab('Event Type') + ylab("Fatalities") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(title="Impact on US Population by Natural Events")+
geom_text(aes(label = Fatalities), size = 3, position = "stack", angel=90)
Top 20 events that are most harmful with respect to population health are
severity$Fatalities<-formatC(severity$Fatalities, big.mark=",", format="d")
kable(head(severity,20), format = "pandoc", padding = 0,row.names = FALSE, caption="Nationawide Top 20 Harmful Events")
Table: Nationawide Top 20 Harmful Events
Event.Type Fatalities
----------------------- ----------
TORNADO 5,633
EXCESSIVE HEAT 1,903
FLASH FLOOD 978
HEAT 937
LIGHTNING 816
TSTM WIND 504
FLOOD 470
RIP CURRENT 368
HIGH WIND 248
AVALANCHE 224
WINTER STORM 206
RIP CURRENTS 204
HEAT WAVE 172
EXTREME COLD 160
THUNDERSTORM WIND 133
HEAVY SNOW 127
EXTREME COLD/WIND CHILL 125
STRONG WIND 103
BLIZZARD 101
HIGH SURF 101
Summarizing property damages with respect to event type. “Economic.Loss”" is the total losses across US by event type.
ecDamage <- as.data.frame(aggregate(subStormData$PROPDMG ~ subStormData$EVTYPE, FUN = sum))
colnames(ecDamage) <- c("Event.Type","Economic.Loss")
ecDamage<-ecDamage[order(ecDamage$Economic.Loss, decreasing=TRUE),]
rownames(ecDamage)<-NULL
str(ecDamage)
'data.frame': 985 obs. of 2 variables:
$ Event.Type : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 153 856 170 760 244 464 786 359 972 ...
$ Economic.Loss: num 3212258 1420125 1335966 899938 876844 ...
ecDamage$Eco_Loss_Thousand<-round(ecDamage$Economic.Loss/1000,2)
Due to big numbers in “Economic.Loss”, we would divide it by 1,000 to make more friendly representation and call that varibale as “Eco_Loss_Thousand”.
Graphical representation of Top 10 harmful events with respect to population health.
y_max<-max(ecDamage$Economic.Loss)
ggplot(data = head(ecDamage,10), aes(x = Event.Type, y = Eco_Loss_Thousand, ymax=y_max/1000+100))+
geom_bar(aes(fill=Event.Type), stat = "Identity", position = position_dodge(width=0.9)) +
xlab('Event Type') + ylab("Economic Losses ( in thousands)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))+
labs(title="US Property Losses ('000) by Top 10 Harmful Events")+
geom_text(aes(label = Eco_Loss_Thousand), size = 3, position = "stack", angle=90)
Top 20 harmful events that are causing most of the property damages across the US.
ecDamage$Eco_Loss_Thousand<-format(ecDamage$Eco_Loss_Thousand, big.mark=",", format="d")
ecDamage$Economic.Loss<-format(ecDamage$Economic.Loss, big.mark=",", format="d")
kable(head(ecDamage,20), format = "pandoc", padding = 0,row.names = FALSE, caption="Nationawide Top 20 Harmful Events")
Table: Nationawide Top 20 Harmful Events
Event.Type Economic.Loss Eco_Loss_Thousand
-------------------- ------------- -----------------
TORNADO 3,212,258.16 3,212.26
FLASH FLOOD 1,420,124.59 1,420.12
TSTM WIND 1,335,965.61 1,335.97
FLOOD 899,938.48 899.94
THUNDERSTORM WIND 876,844.17 876.84
HAIL 688,693.38 688.69
LIGHTNING 603,351.78 603.35
THUNDERSTORM WINDS 446,293.18 446.29
HIGH WIND 324,731.56 324.73
WINTER STORM 132,720.59 132.72
HEAVY SNOW 122,251.99 122.25
WILDFIRE 84,459.34 84.46
ICE STORM 66,000.67 66.00
STRONG WIND 62,993.81 62.99
HIGH WINDS 55,625.00 55.62
HEAVY RAIN 50,842.14 50.84
TROPICAL STORM 48,423.68 48.42
WILD/FOREST FIRE 39,344.95 39.34
FLASH FLOODING 28,497.15 28.50
URBAN/SML STREAM FLD 26,051.94 26.05
Yearly_fatalities<-aggregate(subStormData$FATALITIES, by = list(subStormData$Year),
FUN = "sum", na.rm = TRUE)
Yearly_damage<-aggregate(subStormData$PROPDMG, by = list(subStormData$Year),
FUN = "sum", na.rm = TRUE)
colnames(Yearly_fatalities)<-c("Year","fatalities")
colnames(Yearly_damage)<-c("Year","damage")
Yearly_losses<-merge(Yearly_fatalities, Yearly_damage, by="Year")
rownames(Yearly_losses)<-NULL
last_10Yrs<-tail(Yearly_losses, 10)
last_10Yrs$damage_K<-round(last_10Yrs$damage/1000,2)
y_max_1<-max(last_10Yrs$fatalities)+100
y_max_2<-max(last_10Yrs$damage_K)+100
library(grid)
library(gridExtra)
p1 <-
ggplot(data = last_10Yrs, aes(x = Year, y = fatalities, ymax=y_max_1))+
geom_bar(aes(fill=Year), stat = "Identity", position = position_dodge(width=0.9)) +
xlab('Year') + ylab("Fatalities") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))+
labs(title="Fatalities")+
geom_text(aes(label = fatalities), size = 3, position = "stack", angle=90)
p2 <-
ggplot(data = last_10Yrs, aes(x = Year, y = damage_K, ymax=y_max_2))+
geom_bar(aes(fill=Year), stat = "Identity", position = position_dodge(width=0.9)) +
xlab('Year') + ylab("Property Damages") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))+
labs(title="Property Damages ('000)")+
geom_text(aes(label = damage_K), size = 3, position = "stack", angle=90)
grid.arrange(p1, p2, ncol = 2, main = "Distribution of Losses due to different Events in the US")
We can observe from the analysis that “Tornado” is the most harmful event, which causing maximum losses of properties and harming population health. There are significant amount of fatalities happened due to Excessive Heat and property losses due to Flash Floods, Flood and TSTM Winds.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Note that the echo = TRUE parameter was added to the code chunk to prevent printing of the R code that generated the plot.