The U.S. National Oceanic and Atmospheric Administration (NOAA) maintains a database which that tracks major storm and weather events in the United States. This project attempts to analyze this database for impact on human health and economy due to these events. The database provides detailed information about the event, location and its impact on human life and economy.
It can be summarized that tornadoes and excessive heat have the greatest impact on population fatalities and injuries, while flood and hurrican/typhoon have the greatest impact on property and cost damage in the U.S. in the approximately 45 years that the data was collected.
This section discusses the journey of data from its source through to the final preparation before generating results.
Database source URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
The data file pointed to by the URL was downloaded and unzipped on the local Mac (OSX 10.11.16) on 07/20/2017
library(dplyr)
library (ggplot2)
library(lattice)
library(grid)
library(gridExtra)
stormdbFile <- "repdata%2Fdata%2FStormData.csv"
rawDF <- read.csv(stormdbFile)
rawDF <- tbl_df(rawDF)
The csv data is properly delimited for reading and a heading row has variable names referenced in the NOAA codebook.
The data file is quite large with 900K observations for 37 variables beginning in the year 1966 and ending in 2011 as evidenced by the outputs below.
str(rawDF)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
tail(rawDF)
## # A tibble: 6 Ă— 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## <dbl> <fctr> <fctr> <fctr> <dbl>
## 1 47 11/28/2011 0:00:00 03:00:00 PM CST 21
## 2 56 11/30/2011 0:00:00 10:30:00 PM MST 7
## 3 30 11/10/2011 0:00:00 02:48:00 PM MST 9
## 4 2 11/8/2011 0:00:00 02:58:00 PM AKS 213
## 5 2 11/9/2011 0:00:00 10:21:00 AM AKS 202
## 6 1 11/28/2011 0:00:00 08:00:00 PM CST 6
## # ... with 32 more variables: COUNTYNAME <fctr>, STATE <fctr>,
## # EVTYPE <fctr>, BGN_RANGE <dbl>, BGN_AZI <fctr>, BGN_LOCATI <fctr>,
## # END_DATE <fctr>, END_TIME <fctr>, COUNTY_END <dbl>, COUNTYENDN <lgl>,
## # END_RANGE <dbl>, END_AZI <fctr>, END_LOCATI <fctr>, LENGTH <dbl>,
## # WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>,
## # PROPDMG <dbl>, PROPDMGEXP <fctr>, CROPDMG <dbl>, CROPDMGEXP <fctr>,
## # WFO <fctr>, STATEOFFIC <fctr>, ZONENAMES <fctr>, LATITUDE <dbl>,
## # LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <fctr>,
## # REFNUM <dbl>
For analyzing population impact, this project simply uses EVTYPE, FATALITIES and INJURIES. FATALITIES and INJURIES are aggregated for all the severe weather events.
For analyzing economic impact, this project uses EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. The damage to property and crops is aggregated for all the weather events.
In either case, no attempt is being made here to analyze data per county or by calendar year or latitude/longtitude.
popDF.raw <- rawDF %>%
select(EVTYPE,FATALITIES,INJURIES) %>%
group_by(EVTYPE) %>%
summarize(TOTAL.FATALITIES = sum(FATALITIES), TOTAL.INJURIES = sum(INJURIES)) %>%
arrange(desc(TOTAL.FATALITIES), desc(TOTAL.INJURIES))
ecoDF.raw <- select(rawDF, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
sumFatalities <- sum(popDF.raw$TOTAL.FATALITIES)
sumInjuries <- sum(popDF.raw$TOTAL.INJURIES)
The population data is summarized and sorted to highlight an issue with data cleanliness as discussed below.
In popDF.raw as obtained above, the EVTYPE variable is not very distinct for some events. For example, let’s look at the top 25 entries below.
print(popDF.raw, n=25)
## # A tibble: 985 Ă— 3
## EVTYPE TOTAL.FATALITIES TOTAL.INJURIES
## <fctr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
## 11 WINTER STORM 206 1321
## 12 RIP CURRENTS 204 297
## 13 HEAT WAVE 172 309
## 14 EXTREME COLD 160 231
## 15 THUNDERSTORM WIND 133 1488
## 16 HEAVY SNOW 127 1021
## 17 EXTREME COLD/WIND CHILL 125 24
## 18 STRONG WIND 103 280
## 19 BLIZZARD 101 805
## 20 HIGH SURF 101 152
## 21 HEAVY RAIN 98 251
## 22 EXTREME HEAT 96 155
## 23 COLD/WIND CHILL 95 12
## 24 ICE STORM 89 1975
## 25 WILDFIRE 75 911
## # ... with 960 more rows
The NOAA documentation does not provide clarity in the difference between “HEAT” “EXCESSIVE HEAT” and “EXTREME HEAT”. Therefore, it may be more meaningful to combine some of these events as they appear near the top of our impact list. Some like “WIND” and “WINDS” seem like typos. Some others like “TSTM” and “THUNDERSTORM” mean the same thing but are inconsistently labeled.
These issues have been fixed by relabeling EVTYPE as shown below.
popDF <- popDF.raw %>%
mutate( EVTYPE = gsub("TSTM", "THUNDERSTORM", EVTYPE) ) %>%
mutate( EVTYPE = gsub("WINDS", "WIND", EVTYPE) ) %>%
mutate( EVTYPE = gsub("^WIND$", "HIGH WIND", EVTYPE) ) %>%
mutate( EVTYPE = gsub("STRONG HIGH WIND", "HIGH WIND", EVTYPE) ) %>%
mutate( EVTYPE = gsub("STRONG WIND", "HIGH WIND", EVTYPE) ) %>%
mutate( EVTYPE = gsub("^HEAT$", "EXCESSIVE HEAT", EVTYPE) ) %>%
mutate( EVTYPE = gsub("EXTREME HEAT", "EXCESSIVE HEAT", EVTYPE) ) %>%
mutate( EVTYPE = gsub("RECORD/EXCESSIVE HEAT", "EXCESSIVE HEAT", EVTYPE) ) %>%
mutate( EVTYPE = gsub("HEAT WAVES", "HEAT WAVE", EVTYPE) ) %>%
mutate( EVTYPE = gsub("FLOODING", "FLOOD", EVTYPE) ) %>%
mutate( EVTYPE = gsub("FLOODS", "FLOOD", EVTYPE) ) %>%
mutate( EVTYPE = gsub("FLOOD/FLASH FLOOD", "FLASH FLOOD/FLOOD", EVTYPE) ) %>%
mutate( EVTYPE = gsub("CURRENTS", "CURRENT", EVTYPE) ) %>%
mutate( EVTYPE = gsub("DENSE FOG", "FOG", EVTYPE) ) %>%
mutate( EVTYPE = gsub("^HIGH SURF$", "HEAVY SURF/HIGH SURF", EVTYPE) ) %>%
mutate( EVTYPE = gsub("^HEAVY SURF$", "HEAVY SURF/HIGH SURF", EVTYPE) ) %>%
## Combine and re-sort all the events again
group_by(EVTYPE) %>%
summarize(TOTAL.FATALITIES = sum(TOTAL.FATALITIES), TOTAL.INJURIES = sum(TOTAL.INJURIES),
FPCT = TOTAL.FATALITIES/sumFatalities, IPCT = TOTAL.INJURIES/sumInjuries)
The last lines of the code re-aggregate the newly labeled data.
The PROPDMGEXP variable has several invalid values as shown below.
unique(ecoDF.raw["PROPDMGEXP"])
## # A tibble: 19 Ă— 1
## PROPDMGEXP
## <fctr>
## 1 K
## 2 M
## 3
## 4 B
## 5 m
## 6 +
## 7 0
## 8 5
## 9 6
## 10 ?
## 11 4
## 12 2
## 13 3
## 14 h
## 15 7
## 16 H
## 17 -
## 18 1
## 19 8
The only valid values are k, K, b, B, m or M. Invalid values like numbers etc. need to be imputed. We will arbitarily assign a “K” or a thousand dollar multiplier in such cases. This is assuming that if the economic impact would have been large, the data entry and review at NOAA would have been more thorough.
Although not explicitly shown, a similar issue exists for CROPDMGEXP, so we will impute that variable as well.
Both are resolved as shown below.
ecoDF.temp1 <- ecoDF.raw %>%
mutate(PROPDMGEXP = gsub("[0-9?hH\\+\\-]", "k", PROPDMGEXP)) %>%
mutate(CROPDMGEXP = gsub("[0-9?]", "k", CROPDMGEXP))
The next step would be to convert the PROPDMGEXP and CROPDMGEXP into multipliers for thousands, millions and billions
ecoDF.temp2 <- ecoDF.temp1 %>%
mutate(PROPMULTIPLIER = ifelse(grepl("[bB]",PROPDMGEXP), 10^9, 10^3)) %>%
mutate(CROPMULTIPLIER = ifelse(grepl("[bB]",CROPDMGEXP), 10^9, 10^3)) %>%
mutate(PROPMULTIPLIER = ifelse(grepl("[mM]",PROPDMGEXP), 10^6, PROPMULTIPLIER)) %>%
mutate(CROPMULTIPLIER = ifelse(grepl("[mM]",CROPDMGEXP), 10^6, CROPMULTIPLIER)) %>%
mutate(PROPCOST = PROPMULTIPLIER*PROPDMG) %>%
mutate(CROPCOST = CROPMULTIPLIER*CROPDMG) %>%
mutate(COST = CROPMULTIPLIER*CROPDMG + PROPMULTIPLIER*PROPDMG)
Finally, we can now aggregate and sort the COST for the different EVTYPEs.
ecoDF <- ecoDF.temp2 %>%
group_by(EVTYPE) %>%
summarize(TOTAL.COST = round(sum(COST)/10^6),
PROP.COST = round(sum(PROPCOST)/10^6),
CROP.COST = round(sum(CROPCOST)/10^6)) %>%
arrange(desc(TOTAL.COST))
The following step is not absolutely necessary but will help prettify the results graph and table with cost % annotations.
sumCOST <- sum(ecoDF$TOTAL.COST)
ecoDF <- ecoDF %>%
mutate(COSTPCT = TOTAL.COST/sumCOST) %>%
mutate(CROPPCT = round(CROP.COST/TOTAL.COST,2))
We are now ready to plot data and draw conclusions.
The popDF data has close to 1000 event types, however the impact of several recorded events is negligible given the 45 year long time period. The impact drops off precipitously as seen below, so only use the top 20 entries for FATALITIES and INJURIES in our plots.
top_FATALITIES <- arrange(popDF, desc(TOTAL.FATALITIES)) [1:20,]
top_INJURIES <- arrange(popDF, desc(TOTAL.INJURIES)) [1:20,]
pF <- ggplot(data=top_FATALITIES, aes(x=EVTYPE,y=TOTAL.FATALITIES)) +
geom_bar(stat="identity", fill = "#00FF00") +
scale_x_discrete(limits = top_FATALITIES$EVTYPE) +
geom_text(aes(label = scales::percent(FPCT)), size=3, hjust="inward", color = "black") +
coord_flip()
pI <- ggplot(data=top_INJURIES, aes(x=EVTYPE,y=TOTAL.INJURIES)) +
geom_bar(stat="identity", fill = "#00DDFF") +
scale_x_discrete(limits = top_INJURIES$EVTYPE) +
geom_text(aes(label = scales::percent(IPCT)), size=3, hjust="inward", color = "black") +
coord_flip()
grid.arrange(pF, pI, ncol=2, top = textGrob("Impact Of Weather Events On Population",
gp = gpar(fontsize = 20, font = 3)))
The bar plot on the left indicates human fatalities and the plot on the right shows injuries.
In summary, TORNADOES impact human population the most and the top 5 causes for deaths and injuries are the same even if differently ordered. Finally, NOAA might review the EVTYPE categorization for typos, duplicates and removing irrelevant categorizations.
We can plot the economic cost of the top 25 weather events in a similar manner.
top_eco <- ecoDF [1:20,]
pF <- ggplot(data=top_eco, aes(x=EVTYPE,y=TOTAL.COST)) +
geom_bar(stat="identity", fill = "#00FF00") +
scale_x_discrete(limits = top_eco$EVTYPE) +
labs(y = "COST in Millions") +
labs(title = "Economic Impact Of Weather Events") +
geom_text(aes(label = scales::percent(COSTPCT)), size=3, hjust="inward", color = "black") +
coord_flip()
pF
We can take a quick peek at the top_eco table where the crop damage percentage is shown int he last column.
print(arrange(top_eco[,c(1:2,6)],desc(CROPPCT)),n=10)
## # A tibble: 20 Ă— 3
## EVTYPE TOTAL.COST CROPPCT
## <fctr> <dbl> <dbl>
## 1 DROUGHT 15019 0.93
## 2 ICE STORM 8967 0.56
## 3 RIVER FLOOD 10148 0.50
## 4 HURRICANE 14610 0.19
## 5 HAIL 18759 0.16
## 6 HIGH WIND 5909 0.11
## 7 TSTM WIND 5039 0.11
## 8 THUNDERSTORM WIND 3898 0.11
## 9 FLASH FLOOD 17563 0.08
## 10 TROPICAL STORM 8382 0.08
## # ... with 10 more rows
The table above shows that damage to property cost dominates in most cases. The notable deviation, not surprisingly, is DROUGHT where crop damage accounts for more than 90% of the overall damage.
This concludes our analysis of the NOAA database.