Exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and figuring out the economic / social impact of various events.
This analysis is done on the National Weather Service Storm dataset which has been collected starting in the year 1950 till November 2011. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The following specific questions have been explored in this ananlysis.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
For this analysis, the complete dataset has been read in a dataframe called “data”. Then the “EVTYPE variable has been cleaned based on the given event types. After that, the events which caused maximum fatalities / injuries have been measured and plotted.
Also, the economic effect of these events have been indentified and plotted. This is done based on the “PRODDMG” and “CROPDMG” variables.
The storm database can be downloaded from the following location: Storm Data [47Mb]
The file is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. For the analysis, it has already been downloaded and save locally. Read the data in a dataframe and have a look at the content.
setwd("/Users/promisinganuj/Data/Technical/R Language")
data=read.csv(bzfile("repdata-data-StormData.csv.bz2","rt"),header=TRUE)
# load(".RData")
dim(data)
## [1] 902297 37
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
There are 37 columns in the dataset as shown above. For this analysis, the following fields would be required:
EVTYPE: Environment Type. 48 distinct types.
FALATITIES: Total fatalities in number because of the corresponding event
INJURIES: Total human injuries in number because of the corresponding event
PROPDMG: Property Damage. Rounded to 3 significant digits
PROPDMGEXP: Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
CROPDMG: Crop Damage. Rounded to 3 significant digits.
CROPDMGEXP: Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
For more details about the variable, please refer to the copybook at: Storm Data Documentation
Now check how many unique event types are there in the dataset:
dim(as.data.frame(table(data$EVTYPE)))
## [1] 985 2
By sneaking at the event type values, it is evident that there are redundencies in the event names such as “FLOOD”, “Flood”, “Floods” etc. Lets try to cleanup the variable names.
# Changing values to uppercase
data <- transform(data, EVTYPE = toupper(EVTYPE))
# AVALANCHE
data[data$EVTYPE == "AVALANCE", ]$EVTYPE <- "AVALANCHE"
# COASTAL FLOOD
cFlood <- c("COASTAL FLOODING", "COASTAL FLOODING/EROSION", "COASTAL STORM", "COASTAL SURGE", "COASTAL/TIDAL FLOOD", "COASTALSTORM", "COASTALSTORM", "COASTALFLOOD")
data[data$EVTYPE %in% cFlood, ]$EVTYPE <- "COASTAL FLOOD"
# COLD/WIND CHILL
cCold <- c("COLD","COLD WIND CHILL TEMPERATURES","COLD WEATHER","COLD TEMPERATURES","COLD TEMPERATURE","COLD AND FROST", "COLD AIR FUNNEL")
data[data$EVTYPE %in% cCold, ]$EVTYPE <- "COLD/WIND CHILL"
# DROUGHT
data[data$EVTYPE == "DROUGHT/EXCESSIVE HEAT", ]$EVTYPE <- "DROUGHT"
# EXCESSIVE HEAT
data[data$EVTYPE == "EXTREME HEAT", ]$EVTYPE <- "EXCESSIVE HEAT"
# EXTREME COLD/WIND CHILL
cChill <- c("EXTREME COLD","EXTREME WIND CHILL","EXTREME WINDCHILL","EXTREME WINDCHILL TEMPERATURES","EXTREME/RECORD COLD")
data[data$EVTYPE %in% cChill, ]$EVTYPE <- "EXTREME COLD/WIND CHILL"
# FLASH FLOOD
cFlashFlood <- c("FLASH FLOOD - HEAVY RAIN","FLASH FLOOD FROM ICE JAMS","FLASH FLOOD/FLOOD","FLASH FLOODING","FLASH FLOODING/FLOOD","FLASH FLOODS","FLASH FLOOODING")
data[data$EVTYPE %in% cFlashFlood, ]$EVTYPE <- "FLASH FLOOD"
# THUNDERSTORM WIND
data[grep("^THUN.+ST[O,R,M]{3}.+WIND", data$EVTYPE, value=F),]$EVTYPE = "THUNDERSTORM WIND"
data[grep("TSTM WIND", data$EVTYPE, value=F),]$EVTYPE = "THUNDERSTORM WIND"
Now that the base data has been cleaned, the actual processing can now be done.
For answering this statement, there are two key variables, “FATALITIES” and “INJURIES” which have already been described above. Let’s try to aggreate and plot these values for different event types.
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = data, FUN = sum)
fatalities.order <- fatalities[order(fatalities$FATALITIES, decreasing = TRUE), ]
head(fatalities.order)
## EVTYPE FATALITIES
## 658 TORNADO 5633
## 98 EXCESSIVE HEAT 1999
## 114 FLASH FLOOD 1018
## 212 HEAT 937
## 387 LIGHTNING 816
## 648 THUNDERSTORM WIND 743
Based on above, it is evident that maximum fatalities are caused by “TORNADO” followed distinctly by “EXCESSIVE HEAD” and “FLASH FLOODS”. Now, checking the trend against “injuries”.
injuries <- aggregate(INJURIES ~ EVTYPE, data = data, FUN = sum)
injuries.order <- injuries[order(injuries$INJURIES, decreasing = TRUE), ]
head(injuries.order)
## EVTYPE INJURIES
## 658 TORNADO 91346
## 648 THUNDERSTORM WIND 9478
## 123 FLOOD 6789
## 98 EXCESSIVE HEAT 6680
## 387 LIGHTNING 5230
## 212 HEAT 2100
Again, the maximum injuries are caused by “TORNADO”. Let’s plot the first 10 events causing maximum fatalities and injuries respectively.
par(mfrow = c(2, 1), cex = .75)
barplot(fatalities.order[1:10, 2], col = rainbow(10), ylab = "Fatalities Count", main = "10 natural events cause most fatalities")
legend("topright", legend = fatalities.order[1:10, 1], cex = 0.5, fill = rainbow(10))
barplot(injuries.order[1:10, 2], col = rainbow(10), ylab = "Injuries Count", main = "10 natural events cause most people injuries")
legend("topright", legend = injuries.order[1:10, 1], cex = 0.5, fill = rainbow(10))
The cumulative effect can also be shown as:
most.harmful <- aggregate(INJURIES + FATALITIES ~ EVTYPE, data = data, FUN = sum)
names(most.harmful)[2] <- "COUNT"
most.harmful.order <- most.harmful[order(most.harmful$COUNT, decreasing = TRUE), ]
head(most.harmful.order)
## EVTYPE COUNT
## 658 TORNADO 96979
## 648 THUNDERSTORM WIND 10221
## 98 EXCESSIVE HEAT 8679
## 123 FLOOD 7259
## 387 LIGHTNING 6046
## 212 HEAT 3037
Plotting the cumulative effect as well:
barplot(most.harmful.order[1:10, 2], col = rainbow(10), ylab = "Total Count (Fatalities + Injuries)", main = "10 natural events cause most people injuries and/or fatalities")
legend("topright", legend = most.harmful.order[1:10, 1], cex = 0.5, fill = rainbow(10))
Based on the above observation, it is evident that “TORNADO” is the most harmful event individually and as a whole.
For this analysis, the relevant columns are “EVTYPE”, PRODDMG“,”PROPDMGEXP“,”CROPDMG" and “CROPDMGEXP”. Lets have a look at the “PROPDMG” and “CROPDMG” variables.
unique(data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
as.data.frame(table(data$PROPDMGEXP))
## Var1 Freq
## 1 465934
## 2 - 1
## 3 ? 8
## 4 + 5
## 5 0 216
## 6 1 25
## 7 2 13
## 8 3 4
## 9 4 4
## 10 5 28
## 11 6 4
## 12 7 5
## 13 8 1
## 14 B 40
## 15 h 1
## 16 H 6
## 17 K 424665
## 18 m 7
## 19 M 11330
unique(data$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
as.data.frame(table(data$CROPDMGEXP))
## Var1 Freq
## 1 618413
## 2 ? 7
## 3 0 19
## 4 2 1
## 5 B 9
## 6 k 21
## 7 K 281832
## 8 m 1
## 9 M 1994
As mentioned in the “Data Processing” section, there are only three valid values for “%EXP” variables which are “K”, “M” and “B”. As it can be seen from above data, the values are in mixed cases, let’s change them to uppercase and calculate the actual damage values.
data <- transform(data, PROPDMGEXP = toupper(PROPDMGEXP))
data <- transform(data, CROPDMGEXP = toupper(CROPDMGEXP))
data[data$PROPDMGEXP == "K", ]$PROPDMG <- data[data$PROPDMGEXP == "K", ]$PROPDMG * 1e03
data[data$PROPDMGEXP == "M", ]$PROPDMG <- data[data$PROPDMGEXP == "M", ]$PROPDMG * 1e06
data[data$PROPDMGEXP == "B", ]$PROPDMG <- data[data$PROPDMGEXP == "B", ]$PROPDMG * 1e09
data[data$CROPDMGEXP == "K", ]$CROPDMG <- data[data$CROPDMGEXP == "K", ]$CROPDMG * 1e03
data[data$CROPDMGEXP == "M", ]$CROPDMG <- data[data$CROPDMGEXP == "M", ]$CROPDMG * 1e06
data[data$CROPDMGEXP == "B", ]$CROPDMG <- data[data$CROPDMGEXP == "B", ]$CROPDMG * 1e09
As we are interested in overall economic consequences, it makes sense to add the property and crop damage figures to see the overall trend. Let’s make a plot to summarise the 10 most economically harmful events.
total.damage <- aggregate(CROPDMG + PROPDMG ~ EVTYPE, data = data, sum)
names(total.damage)[2] <- "TOTALDMG"
total.damage.order <- total.damage[order(total.damage$TOTALDMG, decreasing = TRUE), ]
head(total.damage.order)
## EVTYPE TOTALDMG
## 123 FLOOD 150319678257
## 341 HURRICANE/TYPHOON 71913712800
## 658 TORNADO 57352114049
## 565 STORM SURGE 43323541000
## 181 HAIL 18758221521
## 114 FLASH FLOOD 18168929327
barplot(total.damage.order[1:10, 2], col = rainbow(10), ylab = "Total Damage", main = "10 most economically harmful natural events")
legend("topright", legend = total.damage.order[1:10, 1], fill = rainbow(10), cex = 0.5)
Based on the above plot, it is evident that “flooding” has most severe economic consequences followed by “hurricanes” and “tornados”.