This report details the analysis made to understand the public health and economic impact of major weather events in the United States. Impact of events is considered as fatalities and injuries for public health, and property and crop damage for economics.
The data analyzed is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which contains events from 1950 to November 2011. For this study, only data from 1991 to November 2011 was used.
This analysis could be useful to prioritize resources needed for different types of events, and to make disaster contingency plans to be prepared in case of these events happens.
Setting knitr options and libraries needed.
require(lubridate)
require(plyr)
require(stringr)
require(ggplot2)
require(gridExtra)
require(knitr)
The raw data is downloaded from Coursera provided link of NOAA’s storm data.
filename <- "repdata-data-StormData.csv.bz2"
if (!file.exists(filename)) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url=url,destfile=filename)
}
rawdata <- read.csv(filename, strip.white=TRUE)
head(rawdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
For the intended analysis of impact, the variables pertinent are the following:
First, I subset the variables which are used as-is, and give them more descriptive names.
data <- rawdata[,c("EVTYPE","FATALITIES","INJURIES")]
names(data) <- c("event.type","fatalities","injuries")
Next, get year of the event.
data$year <- year(mdy_hms(rawdata$BGN_DATE))
For the damage estimations it’s necessary to normalize the values that will be used for analysis, so I will adjust values of PROPDMG and CROPDMG variables according to PROPDMGEXP and CROPDMEXP code units. From Storm Data Documentation (page 12): “Alphabetical characters used to signify magnitude include K for thousands, M for millions, and B for billions”. Any other character different from the established I’ll assume that there is no multiplication factor, so value in the corresponding damage column would not be affected (multiply by 1).
For this task , I define a function to normalize damage amounts.
## Function to normalize damage amounts
norm_factor <- function(code) {
factor <- 1
code <- toupper(code)
if (code == "B")
factor <- 1000000000
else if (code == "M")
factor <- 1000000
else if (code == "K")
factor <- 1000
return(factor)
}
# Normalize property and crop damage and recalculate in millions base, also
# calculate total damage.
data$property.damage <- rawdata$PROPDMG * sapply(rawdata$PROPDMGEXP,
FUN=norm_factor)
data$crop.damage <- rawdata$CROPDMG * sapply(rawdata$CROPDMGEXP,
FUN=norm_factor)
data$total.damage <- data$property.damage + data$crop.damage
head(data)
## event.type fatalities injuries year property.damage crop.damage
## 1 TORNADO 0 15 1950 25000 0
## 2 TORNADO 0 0 1950 2500 0
## 3 TORNADO 0 2 1951 25000 0
## 4 TORNADO 0 2 1951 2500 0
## 5 TORNADO 0 2 1951 2500 0
## 6 TORNADO 0 6 1951 2500 0
## total.damage
## 1 25000
## 2 2500
## 3 25000
## 4 2500
## 5 2500
## 6 2500
The full range of data compiled by NOAA is 62 years, but considering that early years there was fewer events recorded, most likely due to lack of good records, I use the last 20 years that represent 82.1% of the observations.
## Percentage of last 20 years records
sum(data$year>1990) / length(data$year)
## [1] 0.821009
# Subsetting for the last 20 years (year > 1990)
data <- data[data$year>1990,]
The event types defined by NOAA are 48 types, and data analyzed has 985 types, mainly due to typographics errors or names slightly different. As basic normalization steps, I lowercase types and remove extra blanks, and as a result, there are only 883 types.
length(unique(data$event.type))
## [1] 985
head(unique(data$event.type))
## [1] TSTM WIND HAIL TORNADO
## [4] FREEZING RAIN SNOW ICE STORM/FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
data$event.type <- str_trim(gsub("[[:blank:]]{2,}"," ",data$event.type))
data$event.type <- tolower(data$event.type)
length(unique(data$event.type))
## [1] 883
head(unique(data$event.type))
## [1] "tstm wind" "hail" "tornado"
## [4] "freezing rain" "snow" "ice storm/flash flood"
Important. The event types should be cleaned more accurately, to combine those events that are the same but are classified with an older type or combined types in one record, but this goal requires a very deep understanding of data and I consider it is outside of the goal of this project.
Now, I make the tidy data data frame summarizing total amount fatalities, injuries and property, crop and total damage per event type. After that, we eliminate those event type that has no contribution at all (fatalities, injuries, property damage and crop damage are all equal to zero for a specific event type).
tidydata <- ddply(data, .(event.type), summarize, fatalities=sum(fatalities),
injuries=sum(injuries), property.damage=sum(property.damage),
crop.damage=sum(crop.damage), total.damage=sum(total.damage))
filter <- (tidydata$fatalities + tidydata$injuries +
tidydata$property.damage + tidydata$crop.damage) > 0
tidydata <- tidydata[filter,]
Finally, we have the tidy data required for the analysis.
dim(tidydata)
## [1] 440 6
str(tidydata)
## 'data.frame': 440 obs. of 6 variables:
## $ event.type : chr "?" "agricultural freeze" "apache county" "astronomical high tide" ...
## $ fatalities : num 0 0 0 0 0 1 224 0 1 101 ...
## $ injuries : num 0 0 0 0 0 0 170 0 24 805 ...
## $ property.damage: num 5000 0 5000 9425000 320000 ...
## $ crop.damage : num 0 28820000 0 0 0 ...
## $ total.damage : num 5000 28820000 5000 9425000 320000 ...
head(tidydata)
## event.type fatalities injuries property.damage crop.damage
## 1 ? 0 0 5000 0
## 6 agricultural freeze 0 0 0 28820000
## 7 apache county 0 0 5000 0
## 8 astronomical high tide 0 0 9425000 0
## 9 astronomical low tide 0 0 320000 0
## 10 avalance 1 0 0 0
## total.damage
## 1 5000
## 6 28820000
## 7 5000
## 8 9425000
## 9 320000
## 10 0
Fatalities and injuries cannot be combined for analysis, because they have different implications;for example, injuries may require hospitalization and medical treatment, whereas fatalities don’t. So, I filter the top 5 events by fatalities and the 5 top events by injuries.
## Get the top five (fatalities)
top.fatalities <- arrange(tidydata,desc(fatalities))[,c(1,2)]
top.fatalities$percentage <- top.fatalities$fatalities /
sum(top.fatalities$fatalities) * 100
top.fatalities <- head(top.fatalities,5)
kable(top5.fatalities, format="markdown", col.names=c("Event Type","Fatalities","%"))
Table PH-1. Top 5 events causing fatalities
| Event Type | Fatalities | % |
|---|---|---|
| excessive heat | 1903 | 17.312591 |
| tornado | 1699 | 15.456696 |
| flash flood | 978 | 8.897380 |
| heat | 937 | 8.524381 |
| lightning | 816 | 7.423581 |
## Get the top five (injuries)
top.injuries <- arrange(tidydata,desc(injuries))[,c(1,3)]
top.injuries$percentage <- top.injuries$injuries /
sum(top.injuries$injuries) * 100
top.injuries <- head(top.injuries,5)
kable(top5.injuries, format="markdown", col.names=c("Event Type","Injuries","%"))
Table PH-2. Top 5 events causing injuries
| Event Type | Injuries | % |
|---|---|---|
| tornado | 25497 | 35.474581 |
| flood | 6789 | 9.445697 |
| excessive heat | 6525 | 9.078387 |
| lightning | 5230 | 7.276623 |
| tstm wind | 4441 | 6.178869 |
## Plot these events
p1 <- ggplot(data=top.fatalities, aes(x=reorder(event.type,percentage),
y=percentage, fill=fatalities)) +
geom_bar(stat="identity") +
coord_flip() + theme(legend.position="none") +
xlab("") + ylab("% of fatalities caused by event")
p2 <- ggplot(data=top.injuries, aes(x=reorder(event.type,percentage),
y=percentage, fill=injuries)) +
geom_bar(stat="identity") +
coord_flip() + theme(legend.position="none") +
xlab("") + ylab("% of injuries caused by event")
grid.arrange(p1, p2, main="Fig. 1 - Public Health Top Weather Events (1991-2011)")
From Table PH-1 and Figure 1, Excessive Heat, Tornado, Flash Flood, Heat and Lighting causes 56.19% of fatalities in US. Similarly, from table PH-2 and Figure 1, Tornado, Flood, Excessive Heat, Lightning and Tstm Wind causes 57.04% of injuries in US.
The recomendation is work on prevention for Tornado, Excessive Heat and Lightning for Public Health disaster politics, because those three are common on each top 5.
Economic impact may be totalized from both type of damages, to identify the most dangerous events, so I filter the top 5 events by total damage caused.
## Get the top five
top.damage <- arrange(tidydata,desc(total.damage))[,c(1,4,5,6)]
top.damage$percentage <- top.damage$total.damage /
sum(top.damage$total.damage) * 100
top.damage <- head(top.damage,5)
kable(top5.fatalities, format="markdown", col.names=c("Event Type","Fatalities","%"))
Table EC-1. Top 5 events by economic damage caused
| Event Type | Property Damage | Crop Damage | Total Damage | % Total |
|---|---|---|---|---|
| flood | 144657709807 | 5661968450 | 150319678257 | 33.524792 |
| hurricane/typhoon | 69305840000 | 2607872800 | 71913712800 | 16.038434 |
| storm surge | 43323536000 | 5000 | 43323541000 | 9.662159 |
| tornado | 28897947539 | 414953270 | 29312900809 | 6.537460 |
| hail | 15732267048 | 3025954473 | 18758221521 | 4.183521 |
## Plot these events
p3 <- ggplot(data=top.damage, aes(x=reorder(event.type,percentage),
y=percentage, fill=total.damage)) +
geom_bar(stat="identity") +
coord_flip() + theme(legend.position="none") +
xlab("") + ylab("% of total damage caused by event")
grid.arrange(p3, main="Fig. 2 - Economic Damage Top Weather Events (1991-2011)")
From Table EC-1 and Figure 2, Flood, Hurricane/Typhon causes 49.56% of total damage in US. The recomendation is work on prevention of those types of event to avoid great economic damage.