Analysis of the NOAA storm data reported between 1996 and 2011 shows that, of the 430 different severe weather events, tornadoes, excessive heat, floods including flash floods and lightning cause the most harm to human life leading to fatalities and injuries. Of these, excessive Heat, tornadoes and flash floods are the top three sources of human fatalities accounting for almost half the reported fatalities, while tornadoes, floods and excessive heat are the top three event types leading to injuries. Tornadoes alone account for more than 1/3rd of all reported injuries. Analysis on economic consequences of severe weather events using these data shows that floods, hurricanes/ typhoons, storm surges and tornadoes are the events causing the most economic loss. Droughts cause the highest crop damages, while floods cause the maximum property damages.
This analysis was performed as a part of the coursework for the Data Science: Foundations Using R Specialization certification. Here, using data from National Oceanic & Atmospheric Association (NOAA) official storm events database, I have explored the following questions per the requirements of the assignment.
The data used in this analysis was retrieved from the assignment instructions page. The documentation about the data could be found at National Weather Service Storm Data Documentation as well as National Climatic Data Center FAQ page
dataUrl <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
if (! file.exists("./stormData.csv.bz2")) {
download.file(dataUrl, 'stormData.csv.bz2', mode="wb")
}
stormData <- read.csv("stormData.csv.bz2")
dim(stormData)
## [1] 902297 37
colnames(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
sd <- as.Date(stormData[, "BGN_DATE"], "%m/%d/%Y")
sd <- sort(sd)
head(sd, 4) # first row corresponds to 01/03/1950
## [1] "1950-01-03" "1950-01-03" "1950-01-03" "1950-01-03"
tail(sd, 4) # last row corresponds to 11/30/2011
## [1] "2011-11-30" "2011-11-30" "2011-11-30" "2011-11-30"
unique(stormData$STATE) # shows that all states and territories are covered
## [1] "AL" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
## [46] "VA" "WA" "WV" "WI" "WY" "PR" "AK" "ST" "AS" "GU" "MH" "VI" "AM" "LC" "PH"
## [61] "GM" "PZ" "AN" "LH" "LM" "LE" "LS" "SL" "LO" "PM" "PK" "XX"
The raw data set retrieved from the source consists of 902,297 observations, spanning the event dates from Jan 1950 to Nov. 2011 and spans all US states and territories.
Here is a short description of columns relevant for the current analysis:
| Column | Type | ```````````````````````````````````````````````````````````````Description```````````````````````````````````````````````````````````````````` |
|---|---|---|
| BGN_DATE | chr | Date an event started, in format ‘month/day/year 00:00:00’, 1/3/1950 to 11/30/2011 |
| STATE | chr | State where the event occurred, 2-letter abbreviation for US states and territories, 72 entries total. |
| EVTYPE | chr | Type of the event, there are a total of 958 entries for this column |
| FATALITIES | num | Fatalities associated with the recorded event |
| INJURIES | num | Injuries associated with the recorded event |
| PROPDMG | num | Part one of the property damage data for an event |
| PROPDMGEXP | chr | Holds codes like ‘K’, ‘M’, ‘B’ for thousands, millions etc. With PROPDMG, provides the measure of property damages in USD. |
| CROPDMG | num | Part one of the crop damage data for an event |
| CROPDMGEXP | chr | Holds codes like ‘K’, ‘M’, ‘B’ for thousands, millions etc. With CROPDMG, provides the measure of crop damages in USD. |
The data was processed as follows to make it suitable for analysis:
The current analysis focuses on impact on human health and economy.
All columns not relevant to this analysis were filtered out.
Column BGN_DATE was converted to type date.
A numeric column ‘year’ was added to facilitate analysis.
Data was filtered to only contain data from year 1996 onwards.
The reason for exluding pre-1996 data: During exploratory phase, total events in a year were plotted against year. This plot showed a large discrepancy between number of events reported per year before 1996 and 1996 onwards. The only logical explanation for this discrepancy seems to be differences in reporting, raising the possibility that the analysis may not be accurate if data from all years was used. To mitigate this risk, only data collected from 1996 onwards was used in this analysis.
Column PROPDMG was repopulated with values calculated from PROPDMG and PROPDMGEXP.
Column CROPDMG was repopulated with values calculated from CROPDMG and CROPDMGEXP.
Removed columns PROPDMGEXP and CROPDMGEXP as they are no more necessary.
event type strings we cleaned up. Several entries in the EVTYPE columns were lower case and had trailing spaces. As a result, they were processed as separate event type. Trimmed and uppercased EVTYPE values to avoid this.
# step 1
fd <- stormData[, c("BGN_DATE", "COUNTY", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# step 2 & 3
fd$BGN_DATE <- as.Date(fd$BGN_DATE, "%m/%d/%Y")
fd$year <- format(fd$BGN_DATE, "%Y")
# step 4
fd <- subset(fd, fd$year >= 1996)
# step 5, 6 & 7
unique(fd$PROPDMGEXP)
## [1] "K" "" "M" "B" "0"
unique(fd$PROPDMGEXP)
## [1] "K" "" "M" "B" "0"
convertVal <- function(val, exp) {
if (!is.null(val) & !is.na(val) & val > 0) {
conv <- c(H=2, K=3, M=6, B=9)
expn <- if (exp %in% names(conv)) conv[exp] else 1
val <- unname(val * (10 ^ expn))
}
val
}
fd$PROPDMG <- mapply(convertVal, fd$PROPDMG, fd$PROPDMGEXP)
fd$CROPDMG <- mapply(convertVal, fd$CROPDMG, fd$CROPDMGEXP)
fd <- select(fd, -c(PROPDMGEXP, CROPDMGEXP))
colnames(fd)
## [1] "BGN_DATE" "COUNTY" "STATE" "EVTYPE" "FATALITIES"
## [6] "INJURIES" "PROPDMG" "CROPDMG" "year"
# step 7
length(unique(fd$EVTYPE))
## [1] 516
fd$EVTYPE <- toupper(str_trim(fd$EVTYPE))
length(unique(fd$EVTYPE)) # reduced event types from 516 to 430
## [1] 430
dim(fd)
## [1] 653530 9
kable(head(fd, 3))
| BGN_DATE | COUNTY | STATE | EVTYPE | FATALITIES | INJURIES | PROPDMG | CROPDMG | year | |
|---|---|---|---|---|---|---|---|---|---|
| 248768 | 1996-01-06 | 1 | AL | WINTER STORM | 0 | 0 | 380000 | 38000 | 1996 |
| 248769 | 1996-01-11 | 31 | AL | TORNADO | 0 | 0 | 100000 | 0 | 1996 |
| 248770 | 1996-01-11 | 31 | AL | TSTM WIND | 0 | 0 | 3000 | 0 | 1996 |
kable(tail(fd, 3))
| BGN_DATE | COUNTY | STATE | EVTYPE | FATALITIES | INJURIES | PROPDMG | CROPDMG | year | |
|---|---|---|---|---|---|---|---|---|---|
| 902295 | 2011-11-08 | 213 | AK | HIGH WIND | 0 | 0 | 0 | 0 | 2011 |
| 902296 | 2011-11-09 | 202 | AK | BLIZZARD | 0 | 0 | 0 | 0 | 2011 |
| 902297 | 2011-11-28 | 6 | AL | HEAVY SNOW | 0 | 0 | 0 | 0 | 2011 |
To find the event types that are the most harmful in terms of human life, cumulative fatalities and injuries from 1996 through 2011 were plotted by event type.
fd$casualties <- fd$FATALITIES + fd$INJURIES
esum <- fd %>% group_by(EVTYPE) %>%
summarise(fatalities=sum(FATALITIES), injuries=sum(INJURIES), casualties=sum(casualties), events=n())
f <- arrange(esum, desc(fatalities))[10:1, ]
i <- arrange(esum, desc(injuries))[10:1, ]
c <- arrange(esum, desc(casualties))[10:1, ]
par(mfrow=c(1,3), mar=c(2,8,4,5), oma=c(3,10,8,3), las=1)
barplot(f$fatalities, names.arg=f$EVTYPE, cex.names=1.12, font = 2, horiz=TRUE, main="Fatalities", cex.main=1.5, cex.axis=1.3, las=1)
barplot(i$injuries, names.arg=i$EVTYPE, cex.names=1.1, font = 2, horiz=TRUE, main="Injuries", cex.main=1.5, cex.axis=1.3)
barplot(c$casualties, names.arg=c$EVTYPE, cex.names=1.1, font = 2, horiz=TRUE, main="All casualties", cex.main=1.5, cex.axis=1.3)
title("Fig.1 - Weather event types that cause the most human casualties", outer=TRUE, cex.main=2.5)
mtext(side=3, "cumulative data from 1996-2011", outer=TRUE, cex=1.3, font=2)
box(which="outer", lty="solid")
Figure 1: The figure above shows cumulative fatalities and injuries from 1996 through 2011 by event type. The left plot shows top 10 event types causing maximum cumulative fatalities. The center plot shows top 10 event types causing the maximum cumulative injuries. The right plot shows top 10 event types in terms of combination of fatalities and injuries.
Based on the above plots and data analysis, we can conclude that:
The top 5 contributors to Fatalities are EXCESSIVE HEAT , TORNADO , FLASH FLOOD , LIGHTNING , FLOOD .
The top 5 contributors to Injuries are TORNADO , FLOOD , EXCESSIVE HEAT , LIGHTNING , TSTM WIND .
The top 5 contributors to overall harm to human health are TORNADO , EXCESSIVE HEAT , FLOOD , LIGHTNING , TSTM WIND .
fd$econoDmg<- fd$PROPDMG + fd$CROPDMG
esum <- fd %>% group_by(EVTYPE) %>%
summarise(propertyDmg=sum(PROPDMG), cropDmg=sum(CROPDMG), econoDmg=sum(econoDmg), events=n())
p <- arrange(esum, desc(propertyDmg))[10:1, ]
cr <- arrange(esum, desc(cropDmg))[10:1, ]
e <- arrange(esum, desc(econoDmg))[10:1, ]
par(mfrow=c(1,3), mar=c(4,6,4,5), oma=c(3,10,8,3), las=1)
barplot(p$propertyDmg*(10^-9), names.arg=p$EVTYPE, cex.names=1.12, font=2, horiz=TRUE, main="Property Damage",
xlab="USD, billions", cex.main=1.5, cex.lab=1.5, cex.axis=1.3, las=1)
barplot(cr$cropDmg*(10^-9), names.arg=cr$EVTYPE, cex.names=1.1, font=2, horiz=TRUE, main="Crop Damage",
xlab="USD, billions", cex.main=1.5,cex.lab=1.5, cex.axis=1.3)
barplot(e$econoDmg*(10^-9), names.arg=e$EVTYPE, cex.names=1.1, font=2, horiz=TRUE, main="All damages",
xlab="USD, billions", cex.main=1.5, cex.lab=1.5, cex.axis=1.3)
title("Fig.2 - Weather event types that cause the most negative economic consequences", outer=TRUE, cex.main=2.5)
mtext(side=3, "cumulative data from 1996-2011", outer=TRUE, cex=1.5, font=2)
box(which="outer", lty="solid")
Figure 2: The figure above shows cumulative (1996 thought 2011) property damage and crop damage caused by various weather event types. The left plot shows top 10 most damaging weather event types in terms of property damage. The center plot shows the top 10 event types most damaging to the crops. The right plot shows top 10 event types in terms of cumulative property and crop damage.
Based on the above plots and data analysis, we can conclude that:
The top 5 contributors to property damage are FLOOD , HURRICANE/TYPHOON , STORM SURGE , TORNADO , FLASH FLOOD .
The top 5 contributors to crop damage are DROUGHT , FLOOD , HURRICANE , HURRICANE/TYPHOON , HAIL .
The top 5 contributors to economic cost are FLOOD , HURRICANE/TYPHOON , STORM SURGE , TORNADO , HAIL .