This analysis seeks to answer two very specific questions about the impact on the economy and public health of extreme meteorological events and storms in the US, so they made use U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States and considers that the events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records.
Loading data from the web and processing them in the working directory. The data come in BZ2-zipped and is a .csv file. We just thus unzip it after downloading and load it into R.
library(R.utils)
if(!file.exists("repSTData.csv.bz2"))
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="repSTData.csv.bz2")
if(!file.exists("repSTData.csv"))
bunzip2("repSTData.csv.bz2", overwrite=FALSE, remove=FALSE)
STData = read.csv("repSTData.csv")
Taking a look to the data
head(STData, 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
Making use of the available documentation about the database and considering the questions to answer, are selected only the few variables that we believe relevant for the analysis.
Now extract the above variables and label them
labels = c("tevent", "fatalities",
"injuries", "apdomagnitude", "ompdamage",
"acdomagnitude", "omcdamage")
ImpData = STData[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
names(ImpData) = labels
Again the structure of the data is checked
str(ImpData)
## 'data.frame': 902297 obs. of 7 variables:
## $ tevent : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ fatalities : num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ apdomagnitude: num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ ompdamage : Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ acdomagnitude: num 0 0 0 0 0 0 0 0 0 0 ...
## $ omcdamage : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
It is found that for the Order of magnitude for property damage and the Order of magnitude for crop damage only the “K”, “M” and “B” values will be used.
unique(ImpData$ompdamage)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(ImpData$omcdamage)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
Known values are then replaced by their numerical value (K=1000, M=1000000, B=1000000000). Replace zeros with unknown values.
library(plyr)
ImpData$ompdamage = as.character(ImpData$ompdamage)
ImpData$ompdamage = gsub("\\-|\\+|\\?|h|H|0","0",ImpData$ompdamage)
ImpData$ompdamage = gsub("k|K", "1000", ImpData$ompdamage)
ImpData$ompdamage = gsub("m|M", "1000000", ImpData$ompdamage)
ImpData$ompdamage = gsub("b|B", "1000000000", ImpData$ompdamage)
ImpData$ompdamage = as.numeric(ImpData$ompdamage)
ImpData$ompdamage[is.na(ImpData$ompdamage)] = 0
ImpData$omcdamage = as.character(ImpData$omcdamage)
ImpData$omcdamage = gsub("\\-|\\+|\\?|h|H|0","0",ImpData$omcdamage)
ImpData$omcdamage = gsub("k|K", "1000", ImpData$omcdamage)
ImpData$omcdamage = gsub("m|M", "1000000", ImpData$omcdamage)
ImpData$omcdamage = gsub("b|B", "1000000000", ImpData$omcdamage)
ImpData$omcdamage = as.numeric(ImpData$omcdamage)
ImpData$omcdamage[is.na(ImpData$omcdamage)] = 0
ImpData = mutate(ImpData, Property = apdomagnitude * ompdamage, Crops = acdomagnitude * omcdamage)
In this sections we train to give answer to the questions proposed basing on a graphic analysis of the situations presented.
The following graph shows the most extreme weather events are summarized in the US Considering injuries and fatalities.
library(dplyr)
library(plyr)
library(reshape2)
library(ggplot2)
ImpDataS = ddply(ImpData, "tevent", summarise, tot.injuries = sum(injuries), tot.fatalities = sum(fatalities))
ImpDataS = arrange(ImpDataS, desc(tot.injuries),desc(tot.fatalities))
ImpDataS = head(ImpDataS, 10)
ImpDataS = melt(ImpDataS, id.vars = "tevent")
p = ggplot(ImpDataS, aes(x = tevent, y = value, fill= variable))
p + geom_bar(stat="identity") + coord_flip() + scale_fill_brewer() + labs(x="Event Type", y="Fatalities and Injuries", title="Top 10 most dangerous weather events in the US")
It is noteworthy that weather events more fatalities and injuries occur to the United States are tornadoes and excessive heat events.
In the following graphic summarizes the more extreme weather events in the US. Considering damage in crops and properties.
library(dplyr)
library(plyr)
library(reshape2)
library(ggplot2)
ImpDataS = ddply(ImpData, "tevent", summarise, tot.property = sum(Property), tot.crops = sum(Crops))
ImpDataS = arrange(ImpDataS, desc(tot.property),desc(tot.crops))
ImpDataS = head(ImpDataS, 10)
ImpDataS = melt(ImpDataS, id.vars = "tevent")
p = ggplot(ImpDataS, aes(x = tevent, y = value, fill= variable))
p + geom_bar(stat="identity") + coord_flip() + scale_fill_brewer() + labs(x="Event Type", y="Damages (in US Billions) ", title="Top 10 weather events that cause the most economic damages in the US")
It is noteworthy that the meteorological events that cause more economic damage in the United States are floods, followed by hurricanes / typhoons and tornadoes lesser extent.