Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents the occurence of storms and other significant weather phenomena having sufficient intensity to cause the loss of life, injuries, significant property damage and/or disruption of commerce. It also documents rare, unusual weather phenomena that generate media attention, as well as other significant meteorogial events such as maximum or minimum temperatures or precipitation that occur in connection with another event. The events in the database start in the year 1950 and end in November 2011.
The purpose of this document is to provide an analysis of the NOAA Storm Databae and answer some basic questions about the impact of severe weather events.
First we download the storm data from the NOAA website. It is provided in a bzip format which we will need to uncompress and then load into a data frame.
#Load the libraries we will need for processing
library(dplyr)
library(ggplot2)
library(knitr)
#Download the file from the NOAA website
if(!file.exists("noaa.data.csv.bz2")) {
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="noaa.data.csv.bz2")
}
#Open the file and load the data into a data frame
if (file.exists("noaa.data.csv.bz2"))
noaa.data <- read.csv(bzfile("noaa.data.csv.bz2"), header = TRUE)
There are 902,297 observations in 37 variables.
str(noaa.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Since we don’t need all the 37 variables for our analysis we will prepare a data frame with the columns we will be using.
noaa.analysis <- select(noaa.data,
EventType = EVTYPE,
Fatalities = FATALITIES,
Injuries = INJURIES,
PropertyDamage = PROPDMG,
PropertyDamageMagnitude = PROPDMGEXP,
CropDamage = CROPDMG,
CropDamageMagnitude = CROPDMGEXP)
We end up with a data frame with the following columns:
EventType describes the event type (tornado, hurricane, flood, etc.)
Fatalities number of human deaths caused by the event.
Injuries number of human injuries caused by the event.
PropertyDamage property damage caused by the event in US Dollars.
PropertyDamageMagnitude magnitude of property damage (thousands, millions, etc).
CropDamage crop damage caused by the event in US Dollars.
CropDamageMagnitude magnitude of crop damage (thosands, millions, etc).
The PropertyDamageMagnitude and the CropDamageMagnitude columns indicate the magnitude of the PropertyDamage and the CropDamage columns respectively, we will need to compute two more columns to correctly express the property and crop damage values.
For the purposes of this analysis I am treating the PropertyDamageMagnitude and the CropDamageMagnitude columns as follows: H for hundreds, K for thousands, M for millions, B for billions and will create the corresponding columns with a function that will make the appropriate calculations.
PopulateDamage <- function(damage.value, magnitude) {
factor.value <- 0
magnitude <- toupper(magnitude)
factor.value <- ifelse(magnitude=='B',9,
ifelse(magnitude=='M',6,
ifelse(magnitude=='K',3,
ifelse(magnitude=='H',2,0 ))))
calculated.damage <- damage.value * (10^factor.value)
calculated.damage
}
Now with this function in place, we can add the new columns with the damage amount.
noaa.complete <- mutate(noaa.analysis,
PropertyDamageTotal = PopulateDamage(PropertyDamage, PropertyDamageMagnitude),
CropDamageTotal = PopulateDamage(CropDamage,CropDamageMagnitude))
We now have the data in the format we need for our analysis.
In order to determine the events that caused more harm to population health we will select the events with the higher number of fatalities and injuries.
#Group by event
by.event <- group_by (noaa.complete, EventType)
health.harm.top10 <- summarize(by.event, TotalFatalities = sum(Fatalities), TotalInjuries = sum(Injuries),
TotalHarm = sum(Fatalities) + sum(Injuries)) %>%
arrange(desc(TotalHarm)) %>%
top_n(10)
These are the events that cause more harm to population health:
kable(health.harm.top10, caption = "Most Harmful Events to Population Health")
| EventType | TotalFatalities | TotalInjuries | TotalHarm |
|---|---|---|---|
| TORNADO | 5633 | 91346 | 96979 |
| EXCESSIVE HEAT | 1903 | 6525 | 8428 |
| TSTM WIND | 504 | 6957 | 7461 |
| FLOOD | 470 | 6789 | 7259 |
| LIGHTNING | 816 | 5230 | 6046 |
| HEAT | 937 | 2100 | 3037 |
| FLASH FLOOD | 978 | 1777 | 2755 |
| ICE STORM | 89 | 1975 | 2064 |
| THUNDERSTORM WIND | 133 | 1488 | 1621 |
| WINTER STORM | 206 | 1321 | 1527 |
ggplot(health.harm.top10, aes(x=EventType, y=TotalHarm, fill=EventType))+
geom_bar(colour="black", stat="identity") + coord_flip() +
labs(title="Most Harmful Events to Population Health", x="Event", y="Total Harm") +
theme(title=element_text(size=18,face="bold"),
axis.text=element_text(size=12),
axis.title=element_text(size=14,face="bold"))
most.damage <- group_by(noaa.complete, EventType) %>%
summarize( TotalEconomicDamage = sum(PropertyDamageTotal + CropDamageTotal) / 1000000000) %>%
arrange(desc(TotalEconomicDamage)) %>%
top_n(10)
These are the events with the greatest economic consequences, expressed in billions of US Dollars:
ggplot(most.damage, aes(x=EventType, y=TotalEconomicDamage, fill=EventType))+
geom_bar(colour="black", stat="identity") + coord_flip() +
labs( x="Event", y="Total damage") +
ggtitle(expression(atop("Events with the Greatest Economic Consequences",
atop("In billions of US Dollars", "")))) +
theme(title=element_text(size=18,face="bold"),
axis.text=element_text(size=12),
axis.title=element_text(size=14,face="bold"))
Tornadoes have caused the most harm to population while floods have caused the greatest economic damages.