Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. Here we present an analysis based on the NOAA Storm Database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The analysis adresses the following two questions:
The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded from here.
There is also some documentation of the database available. There it is explained how some of the variables are constructed/defined.
Using R, the data can be downloaded using the link given above and loaded to R using the following code chunk. The data are store in a variable called data.
url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="repdata_data_StormData.csv.bz2")
dateDownloaded <- date()
data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
By executing the following code in R we get a summary of the data
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
As we can see from the summary above, the data frame contains 902297 records with 37 variables. For our analysis we need only the following 7 variables:
The EVTYPE variable contains 985 unique values. According to the documentation of the database, there should be only 48 Storm event types. In order to be consistent, we are going to use the data records of the 48 designated event types given in the documentation of the database.
Also, to be consistent we are going to make all the string values of the dataset to upper case to avoid any duplicates.
The code below executes the steps described above to create a new dataset.
tidy <- data[, c("EVTYPE", "PROPDMG", "PROPDMGEXP", "FATALITIES", "INJURIES", "CROPDMG", "CROPDMGEXP")]
approvedTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
approvedTypes <- toupper(approvedTypes)
tidy$EVTYPE <- toupper(tidy$EVTYPE)
tidy <- tidy[tidy$EVTYPE %in% approvedTypes,]
The data now contains 635291 records with 7 variables.
In the sequence, we are going to deal with the PROPDMGEXP and CROPDMGEXP variables. Below we can see the summary of the PROPDMGEXP and CROPDMGEXP variables respectively.
summary(tidy$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 279394 1 6 2 51 8 8 1 1 18
## 6 7 8 B h H K m M
## 1 4 1 26 0 1 345724 4 10040
summary(tidy$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 279394 1 6 2 51 8 8 1 1 18
## 6 7 8 B h H K m M
## 1 4 1 26 0 1 345724 4 10040
According to the documentation of the database, the accepted values are H, K, M and B standing for hundrends, thousands, millions and billions, respectively. The empty values are also acceptable. The lower case h, k, m and b will be transformed to upper case and will be accepted. All the records with the rest of the values are going to be removed from the dataset and the calculation. The impact is negligible.
In the sequence, the values from the PROPDMGEXP and CROPDMGEXP are going to be multiplied with the values of PROPDMG and CROPDMG to get the complete amount of dollars in the new columns PROPDMGTOTAL and CROPDMGTOTAL by using the respective values for the exponentials H, K, M and B (100, 1000, 10^6 and 10^9). The final tidy dataset is going to include the following 5 variables:
In the code below executes the steps mentioned above.
tidy$PROPDMGEXP <- toupper(tidy$PROPDMGEXP)
tidy$CROPDMGEXP <- toupper(tidy$CROPDMGEXP)
exponentials <- c("H","K","M","B","")
tidy <- tidy[tidy$PROPDMGEXP %in% exponentials,]
tidy <- tidy[tidy$CROPDMGEXP %in% exponentials,]
tidy$PROPDMGEXP[tidy$PROPDMGEXP==""] <- 0
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="H"] <- 100
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="K"] <- 1000
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="M"] <- 1000000
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="B"] <- 1000000000
tidy$PROPDMGTOTAL <- as.numeric(tidy$PROPDMGEXP)*tidy$PROPDMG
tidy$CROPDMGEXP[tidy$CROPDMGEXP==""] <- 0
tidy$CROPDMGEXP[tidy$CROPDMGEXP=="K"] <- 1000
tidy$CROPDMGEXP[tidy$CROPDMGEXP=="M"] <- 1000000
tidy$CROPDMGEXP[tidy$CROPDMGEXP=="B"] <- 1000000000
tidy$CROPDMGTOTAL <- as.numeric(tidy$CROPDMGEXP)*tidy$CROPDMG
tidy <- tidy[, c("EVTYPE", "PROPDMGTOTAL", "FATALITIES",
"INJURIES", "CROPDMGTOTAL")]
At this point the data are tidy and clean ready for our analysis. Below is the summary of the data.
str(tidy)
## 'data.frame': 635181 obs. of 5 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ PROPDMGTOTAL: num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ CROPDMGTOTAL: num 0 0 0 0 0 0 0 0 0 0 ...
In order to perform our analysis and provide the results, the following packages are used:
library(ggplot2)
The top 10 most harmful types of events that cause fatalities to the population can be calculated by using the following R code:
totalFatalitiesPerEvent <- aggregate(tidy$FATALITIES,
by=list(tidy$EVTYPE), sum)
colnames(totalFatalitiesPerEvent) <- c("EVTYPE","FATALITIES")
totalFatalitiesPerEvent <- totalFatalitiesPerEvent[
order(totalFatalitiesPerEvent$FATALITIES, decreasing=TRUE),]
and are presented below
print(totalFatalitiesPerEvent[1:10,1:2], row.names=FALSE)
## EVTYPE FATALITIES
## TORNADO 5630
## EXCESSIVE HEAT 1903
## FLASH FLOOD 978
## HEAT 937
## LIGHTNING 816
## FLOOD 470
## RIP CURRENT 368
## HIGH WIND 246
## AVALANCHE 224
## WINTER STORM 206
The top 10 most harmful types of events that cause injuries to the population can be calculated by using the following R code:
totalInjuriesPerEvent <- aggregate(tidy$INJURIES,
by=list(tidy$EVTYPE), sum)
colnames(totalInjuriesPerEvent) <- c("EVTYPE","INJURIES")
totalInjuriesPerEvent <- totalInjuriesPerEvent[
order(totalInjuriesPerEvent$INJURIES, decreasing=TRUE),]
and are presented below
print(totalInjuriesPerEvent[1:10,1:2], row.names=FALSE)
## EVTYPE INJURIES
## TORNADO 91321
## FLOOD 6789
## EXCESSIVE HEAT 6525
## LIGHTNING 5230
## HEAT 2100
## ICE STORM 1975
## FLASH FLOOD 1777
## THUNDERSTORM WIND 1488
## HAIL 1360
## WINTER STORM 1321
Below is a graph that further summarizes the top 10 most harmful events in term of fatalities
plot1 <- ggplot(totalFatalitiesPerEvent[1:10,],
aes(x=reorder(EVTYPE, FATALITIES) ,y=FATALITIES)) +
geom_bar(stat="identity",fill="dark red") + coord_flip() +
labs(x = "Event type",
y="Fatalities",
title="Top 10 events by Fatalities")
plot1
Here is the graph that summarizes the top 10 most harmful events in term of injuries
plot2 <- ggplot(totalInjuriesPerEvent[1:10,],
aes(x=reorder(EVTYPE, INJURIES) ,y=INJURIES)) +
geom_bar(stat="identity",fill="red") + coord_flip() +
labs(x = "Event type",
y="Injuries",
title="Top 10 events by Injuries")
plot2
It is obvious that the Storm Event that is most harmful for the population is the Tornado Storm Event.
The top 10 most harmful types of events for the economy can be calculated by using the following R code:
EconomyDamagePROP <- aggregate(tidy$PROPDMGTOTAL,
by=list(tidy$EVTYPE), sum)
colnames(EconomyDamagePROP) <- c("EVTYPE","PROPDAMAGES")
EconomyDamageCROP <- aggregate(tidy$CROPDMGTOTAL,
by=list(tidy$EVTYPE), sum)
colnames(EconomyDamageCROP) <- c("EVTYPE","CROPDAMAGES")
totalEconomyDamages <- merge(EconomyDamagePROP,EconomyDamageCROP,
by.x ="EVTYPE", by.y ="EVTYPE")
totalEconomyDamages$DAMAGES <- totalEconomyDamages$PROPDAMAGES +
totalEconomyDamages$CROPDAMAGES
totalEconomyDamages <- totalEconomyDamages[
order(totalEconomyDamages$DAMAGES, decreasing=TRUE),]
and are presented below
print(totalEconomyDamages[1:10, c("EVTYPE","DAMAGES")],
row.names=FALSE)
## EVTYPE DAMAGES
## FLOOD 150319678250
## HURRICANE/TYPHOON 71913712800
## TORNADO 57301935590
## HAIL 18733216670
## FLASH FLOOD 17561538610
## DROUGHT 15018672000
## ICE STORM 8967037810
## TROPICAL STORM 8382236550
## WINTER STORM 6715441250
## HIGH WIND 5908617560
Below is a graph that further summarizes the top 10 most harmful events in term of economy
plot3 <- ggplot(totalEconomyDamages[1:10,],
aes(x=reorder(EVTYPE, DAMAGES) ,y=DAMAGES/10^9)) +
geom_bar(stat="identity",fill="dark orange") + coord_flip() +
labs(x = "Event type",
y="Billions of dollars",
title="Top 10 events by combined economical damages")
plot3
It is obvious that the Storm Event that is most harmful to the economy in total is the Flood.