Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This report explores the NOAA Storm Database and aim to answer some basic questions about the severe weather events that happened across the United States between 1950 and 2011:
The storm event data is provided by Coursera and can be downloaded from here. It contains the data colelcted for the events that happened from the year 1950 to the end of November 2011. Data is originaly coming from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
We first read in the storm event data from the CSV text file included in the zip archive.
data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
The events data from 1950 to November 2011 has 37 variables and 902297 observations.
dim(data)
## [1] 902297 37
Let’s have a look at the missing data. Missing data can be detected in a few variables of the dataset, fortunately not the ones needed by the analysis, hence data imputation is not required:
colSums(is.na(data))
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 843563 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
The event type variable contains many similar values and has to be cleaned for more accuracy when doing aggregations. Moreover we take the opportunity to transform all the variable names to lower cases.
library(dplyr)
names(data) <- tolower(names(data))
data.tidy <- data %>% mutate(evtype=tolower(gsub("[/.&-// ]","",evtype)))
This section gives concrete answers to both above questions, based on the analysis of the prepared tidy data.
The variables needed for the analysis of the events which had an impact on the population health are selected and summarized, the count of fatalities and injuries are calculated by event type.
data.events.health <- data.tidy %>% select(evtype, fatalities, injuries)
data.events.health.frequency <- summarise(group_by(data.events.health, evtype),
injuries=sum(injuries),
fatalities=sum(fatalities),
frequency=fatalities+injuries)
The ten most harmful types of event can be obtained easily by a descendant sort.
data.events.health.frequency.10 <- data.events.health.frequency %>%
arrange(desc(frequency)) %>%
slice(1:10)
data.events.health.frequency.10
## # A tibble: 10 × 4
## evtype injuries fatalities frequency
## <chr> <dbl> <dbl> <dbl>
## 1 tornado 91346 5633 96979
## 2 excessiveheat 6525 1903 8428
## 3 tstmwind 6957 504 7461
## 4 flood 6789 470 7259
## 5 lightning 5230 817 6047
## 6 heat 2100 937 3037
## 7 flashflood 1777 978 2755
## 8 icestorm 1975 89 2064
## 9 thunderstormwind 1488 133 1621
## 10 winterstorm 1321 206 1527
As a result, tornados are the event that had the most impact on population health over the last six decades.
The next plot shows a graphical presentation of the above tabular view.
library(ggplot2)
library(scales)
data.events.health.frequency.10$evtype <- reorder(data.events.health.frequency.10$evtype, -data.events.health.frequency.10$frequency)
ggplot(data.events.health.frequency.10, aes(x=evtype,y=frequency)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label = frequency), vjust=1.6, color="white", size=3.5) +
scale_y_log10(labels = trans_format("log10", math_format(10^.x))) +
theme(axis.text.x = element_text(angle=45,hjust=1,vjust=1.0)) +
labs(title="Ten most harmful events from 1950 to 2011", x="event")
The variables needed for the analysis of the events which had the most economic impact are selected and summarized, the costs of the damages are calculated and summed.
Some preliminary work is required because the original dataset does not contains the definitive costs. First of all let’s build a table of factors based on the scale factors detected in both exponent variables.
library(data.table)
coefficientsDT <- data.table(x=c("","H","K","M","B"), y=c(1,100,1000,1e+06, 1e+09))
setkey(coefficientsDT)
These coefficients are applied to the provisory costs values in order to calculate the final costs.
data.events.damages <- data.tidy %>%
select(evtype, propdmg, propdmgexp, cropdmg, cropdmgexp) %>%
mutate(
propdmg.cost=coefficientsDT[as.character(toupper(propdmgexp)),y]*propdmg,
cropdmg.cost=coefficientsDT[as.character(toupper(cropdmgexp)),y]*cropdmg)
The summarization of the costs by event type can now be executed.
data.events.damages.costs <- summarise(group_by(data.events.damages, evtype),
propdmg.costs=sum(propdmg.cost),
cropdmg.costs=sum(cropdmg.cost),
costs=sum(propdmg.cost+cropdmg.cost))
The ten types of event with most economic impact can be obtained easily by a descendant sort.
data.events.damages.costs.10 <- data.events.damages.costs %>%
arrange(desc(costs)) %>%
slice(1:10)
data.events.damages.costs.10
## # A tibble: 10 × 4
## evtype propdmg.costs cropdmg.costs costs
## <chr> <dbl> <dbl> <dbl>
## 1 hurricanetyphoon 69305840000 2607872800 71913712800
## 2 stormsurge 43323536000 5000 43323541000
## 3 hurricane 11868319010 2741910000 14610229010
## 4 riverflood 5118945500 5029459000 10148404500
## 5 tropicalstorm 7703890550 678346000 8382236550
## 6 wildfire 4765114000 295472800 5060586800
## 7 stormsurgetide 4641188000 850000 4642038000
## 8 hurricaneopal 3172846000 19000000 3191846000
## 9 wildforestfire 3001829500 106796830 3108626330
## 10 heavyrainsevereweather 2500000000 0 2500000000
As a result, hurricanes and typhoons are the event that had the most economic impact over the last six decades.
The next plot shows a graphical presentation of the above tabular view.
data.events.damages.costs.10$evtype <- reorder(
data.events.damages.costs.10$evtype, -data.events.damages.costs.10$costs)
ggplot(data.events.damages.costs.10, aes(x=evtype,y=costs)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label = costs), vjust=1.6, color="white", size=2.5) +
scale_y_log10(labels = trans_format("log10", math_format(10^.x))) +
theme(axis.text.x = element_text(angle=45,hjust=1,vjust=1.0)) +
labs(title="Ten events with greatest economic consequences from 1950 to 2011", x="event")
As a result, the variables of the origin dataset and their observations reveal that tornados, hurricanes and typhoons are the types of event with most economic and health consequences over the last six decades.