Analysis of the NOAA Storm Database to Identify Weather Events Most Harmful to Population Health and Those with the Greatest Economic Consequences

Project Summary

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The purpose of the analysis is to identify severe weather events that are most harmful to population health (measured by fatalities and injuries) and those with the greatest economic consequences (measured by property and crop damage). The results of this analysis can be used by government officials to prepare for severe weather events and to prioritize resources for different types of events.

Data Processing

Step 1: Load packages and read in the data using read.csv

setwd("C:/Users/yxj4/OneDrive - CDC/+My_Documents/CDC/1 DDT/Data Modernization/Coursera/Hopkins Data Science Specialization/5 Reproducible Research/4 Course Project 2/final_project")
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
noaa <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = ",", na.strings = "")

Step 2: Explore the data

str(noaa)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  NA NA NA NA ...
##  $ BGN_LOCATI: chr  NA NA NA NA ...
##  $ END_DATE  : chr  NA NA NA NA ...
##  $ END_TIME  : chr  NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  NA NA NA NA ...
##  $ END_LOCATI: chr  NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ WFO       : chr  NA NA NA NA ...
##  $ STATEOFFIC: chr  NA NA NA NA ...
##  $ ZONENAMES : chr  NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
describe(noaa$EVTYPE)
## noaa$EVTYPE 
##        n  missing distinct 
##   902297        0      985 
## 
## lowest :    HIGH SURF ADVISORY  COASTAL FLOOD         FLASH FLOOD           LIGHTNING             TSTM WIND           
## highest: WINTERY MIX           Wintry mix            Wintry Mix            WINTRY MIX            WND
describe(noaa$FATALITIES)
## noaa$FATALITIES 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   902297        0       52    0.023  0.01678  0.03344        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
## 
## lowest :   0   1   2   3   4, highest:  99 114 116 158 583

Step 3: Create a smaller data set by only keeping the necessary variables

noaa2 <- noaa[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "BGN_DATE")]

Step 4: Create a data set that only contains data from 1996 to 2011

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. According to this website, NOAA started recording all severe weather events beginning in 1996: https://www.ncdc.noaa.gov/stormevents/details.jsp. Therefore, in order to provide a fair comparison between all weather event types in terms of the consequences for public health and the economy, we should only evaluate events that happened in 1996 and after.

range(noaa2$BGN_DATE)
## [1] "1/1/1966 0:00:00" "9/9/2011 0:00:00"
noaa2$year <- format(as.Date(noaa2$BGN_DATE, format="%m/%d/%Y %H:%M:%S"),"%Y")
range(noaa2$year)
## [1] "1950" "2011"
table(noaa2$year)
## 
##  1950  1951  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961  1962 
##   223   269   272   492   609  1413  1703  2184  2213  1813  1945  2246  2389 
##  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973  1974  1975 
##  1968  2348  2855  2388  2688  3312  2926  3215  3471  2168  4463  5386  4975 
##  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985  1986  1987  1988 
##  3768  3728  3657  4279  6146  4517  7132  8322  7335  7979  8726  7367  7257 
##  1989  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001 
## 10410 10946 12522 13534 12607 20631 27970 32270 28680 38128 31289 34471 34962 
##  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011 
## 36293 39752 39363 39184 44034 43289 55663 45817 48161 62174
noaa3 <- subset(noaa2, year>=1996)
range(noaa3$year)
## [1] "1996" "2011"
table(noaa3$year)
## 
##  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008 
## 32270 28680 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 
##  2009  2010  2011 
## 45817 48161 62174

Step 5: Perform pre-processing on variables used in the calculation of property damage and crop damage

There are four variables used to calculate property damage and crop damage:
- PROPDMG: the magnitude of property damage in U.S. Dollars
- PROPDMGEXP: the exponent values for property damage (meaning 10^PROPDMGEXP)
- CROPDMG: the magnitude of crop damage in U.S. Dollars
- CROPDMGEXP: the exponent values for crop damage (meaning 10^CROPDMGEXP)

  • Total property damage in billions of U.S. Dollars is calculated as PROPDMG * 10^PROPDMGEXP / 1000000000
  • Total crop damage in billions of U.S. Dollars is calculated as CROPDMG * 10^CROPDMGEXP / 1000000000

Step 5a: Pre-process property damage variables

  • We need to convert these values “K” NA “M” “B” “0” to a numeric for the exponent for PROPDMGEXP.
describe(noaa3$PROPDMG)
## noaa3$PROPDMG 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   653530        0     1381    0.641    11.69    21.92        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        1       15       50 
## 
## lowest :    0.00    0.01    0.02    0.03    0.04
## highest: 3200.00 3500.00 4410.00 4800.00 5000.00
unique(noaa3$PROPDMGEXP)
## [1] "K" NA  "M" "B" "0"
sum(is.na(noaa3$PROPDMGEXP))
## [1] 276185
sum(!is.na(noaa3$PROPDMGEXP))
## [1] 377345
table(noaa3$PROPDMGEXP)
## 
##      0      B      K      M 
##      1     32 369938   7374
noaa3$PROPDMGEXP <- gsub("K", "3", noaa3$PROPDMGEXP) #Create an exponent value for thousand
noaa3$PROPDMGEXP <- gsub("M", "6", noaa3$PROPDMGEXP) #Create an exponent value for million
noaa3$PROPDMGEXP <- gsub("B", "9", noaa3$PROPDMGEXP) #Create an exponent value for billion
noaa3$PROPDMGEXP <- gsub("NA", "NA", noaa3$PROPDMGEXP) 
noaa3$PROPDMGEXP <- gsub("0", "0", noaa3$PROPDMGEXP) 
noaa3$PROPDMGEXP <- as.numeric(noaa3$PROPDMGEXP)
sum(is.na(noaa3$PROPDMGEXP))
## [1] 276185
sum(!is.na(noaa3$PROPDMGEXP))
## [1] 377345
unique(noaa3$PROPDMGEXP)
## [1]  3 NA  6  9  0
table(noaa3$PROPDMGEXP)
## 
##      0      3      6      9 
##      1 369938   7374     32
noaa3$propertydamage <- (noaa3$PROPDMG * (10 ^ noaa3$PROPDMGEXP))/1000000000

Step 5b: Pre-process crop damage variables

  • We need to convert these values “K” NA “M” “B” to a numeric for the exponent for CROPDMGEXP.
describe(noaa3$CROPDMG)
## noaa3$CROPDMG 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   653530        0      415    0.083    1.839    3.652        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
## 
## lowest :   0.00   0.01   0.10   0.15   0.20, highest: 950.00 975.00 978.00 985.00 990.00
unique(noaa3$CROPDMGEXP)
## [1] "K" NA  "M" "B"
sum(is.na(noaa3$CROPDMGEXP))
## [1] 373069
sum(!is.na(noaa3$CROPDMGEXP))
## [1] 280461
table(noaa3$CROPDMGEXP)
## 
##      B      K      M 
##      4 278686   1771
noaa3$CROPDMGEXP <- gsub("K", "3", noaa3$CROPDMGEXP) #Create an exponent value for thousand
noaa3$CROPDMGEXP <- gsub("M", "6", noaa3$CROPDMGEXP) #Create an exponent value for million
noaa3$CROPDMGEXP <- gsub("B", "9", noaa3$CROPDMGEXP) #Create an exponent value for billion
noaa3$CROPDMGEXP <- gsub("NA", "NA", noaa3$CROPDMGEXP) 
noaa3$CROPDMGEXP <- as.numeric(noaa3$CROPDMGEXP)
sum(is.na(noaa3$CROPDMGEXP))
## [1] 373069
sum(!is.na(noaa3$CROPDMGEXP))
## [1] 280461
unique(noaa3$CROPDMGEXP)
## [1]  3 NA  6  9
table(noaa3$CROPDMGEXP)
## 
##      3      6      9 
## 278686   1771      4
noaa3$cropdamage <- (noaa3$CROPDMG * (10 ^ noaa3$CROPDMGEXP))/1000000000

Results

Step 6: Answer the question – Across the United States, which types of severe weather events are most harmful with respect to population health?

Step 6a: Explore fatalities due to severe weather events

  • Answer: The graph below shows the top 10 severe weather events associated with the greatest number of fatalities. The top three are excessive heat, tornado, and flash flood.
totalfatalities <- aggregate(FATALITIES ~ EVTYPE, noaa3, sum)
totalfatalities[which.max(totalfatalities$FATALITIES),]
##            EVTYPE FATALITIES
## 81 EXCESSIVE HEAT       1797
#Create a smaller dataset with the top 10 events by total number of fatalities
toptenFatalities <- totalfatalities[order(-totalfatalities$FATALITIES), ][1:10, ]
toptenFatalities
##             EVTYPE FATALITIES
## 81  EXCESSIVE HEAT       1797
## 426        TORNADO       1511
## 98     FLASH FLOOD        887
## 224      LIGHTNING        651
## 102          FLOOD        414
## 300    RIP CURRENT        340
## 434      TSTM WIND        241
## 147           HEAT        237
## 177      HIGH WIND        235
## 16       AVALANCHE        223
#Plot the total number of fatalities by event type for the top 10 events
ggplot(data=toptenFatalities, aes(x=reorder(EVTYPE, FATALITIES), y=FATALITIES, fill=EVTYPE)) + geom_bar(stat="identity") + xlab("Weather Event") + ylab("Total number of fatalities") + ggtitle("Top 10 Events with Highest Fatalities") + coord_flip()

Step 6b: Explore injuries due to severe weather events

  • Answer: The graph below shows the top 10 severe weather events associated with the greatest number of injuries. The top three are tornado, flood, and excessive heat.
totalinjuries <- aggregate(INJURIES ~ EVTYPE, noaa3, sum)
totalinjuries[which.max(totalinjuries$INJURIES),]
##      EVTYPE INJURIES
## 426 TORNADO    20667
#Create a smaller dataset with the top 10 events by total number of injuries
toptenInjuries <- totalinjuries[order(-totalinjuries$INJURIES), ][1:10, ]
toptenInjuries
##                EVTYPE INJURIES
## 426           TORNADO    20667
## 102             FLOOD     6758
## 81     EXCESSIVE HEAT     6391
## 224         LIGHTNING     4141
## 434         TSTM WIND     3629
## 98        FLASH FLOOD     1674
## 421 THUNDERSTORM WIND     1400
## 507      WINTER STORM     1292
## 185 HURRICANE/TYPHOON     1275
## 147              HEAT     1222
#Plot the total number of injuries by event type for the top 10 events
ggplot(data=toptenInjuries, aes(x=reorder(EVTYPE, INJURIES), y=INJURIES, fill=EVTYPE)) + geom_bar(stat="identity") + xlab("Weather Event") + ylab("Total number of injuries") + ggtitle("Top 10 Events with Highest Injuries") + coord_flip()

Step 7: Answer the question – Across the United States, which types of events have the greatest economic consequences?

Step 7a: Explore property damage (in billions of USD) due to severe weather events

  • Answer: The graph below shows the top 10 severe weather events associated with the largest amount of property damage. The top three are floods, hurricane/typhoon, and storm surge.
totalpropertydamage <- aggregate(propertydamage ~ EVTYPE, noaa3, sum)
totalpropertydamage[which.max(totalpropertydamage$propertydamage),]
##    EVTYPE propertydamage
## 42  FLOOD       143.9448
#Create a smaller dataset with the top 10 events by total number of fatalities
toptenpropdamag <- totalpropertydamage[order(-totalpropertydamage$propertydamage), ][1:10, ]
toptenpropdamag
##                EVTYPE propertydamage
## 42              FLOOD     143.944834
## 84  HURRICANE/TYPHOON      69.305840
## 134       STORM SURGE      43.193536
## 143           TORNADO      24.616946
## 40        FLASH FLOOD      15.222204
## 67               HAIL      14.595143
## 83          HURRICANE      11.812819
## 145    TROPICAL STORM       7.642476
## 80          HIGH WIND       5.247860
## 168          WILDFIRE       4.758667
#Plot the total amount of property damage by event type for the top 10 events
ggplot(data=toptenpropdamag, aes(x=reorder(EVTYPE, propertydamage), y=propertydamage, fill=EVTYPE)) + geom_bar(stat="identity") + xlab("Weather Event") + ylab("Total Property Damange in Billions of Dollars") + ggtitle("Events with Highest Property Damange (Billions of Dollars)") + coord_flip()

Step 7b: Explore crop damage (in billions of USD) due to severe weather events

  • Answer: The table below shows the top 10 severe weather events associated with the largest amount of crop damage. The top three are drought, flood, and hurricane.
totalcropdamage <- aggregate(cropdamage ~ EVTYPE, noaa3, sum)
totalcropdamage[which.max(totalcropdamage$cropdamage),]
##     EVTYPE cropdamage
## 11 DROUGHT   13.36757
#Create a smaller dataset with the top 10 events by total number of fatalities
toptencropdamag <- totalcropdamage[order(-totalcropdamage$cropdamage), ][1:10, ]
toptencropdamag
##               EVTYPE cropdamage
## 11           DROUGHT 13.3675660
## 22             FLOOD  4.9747784
## 41         HURRICANE  2.7414100
## 42 HURRICANE/TYPHOON  2.6078728
## 32              HAIL  2.4760294
## 21       FLASH FLOOD  1.3349017
## 18      EXTREME COLD  1.2889730
## 28      FROST/FREEZE  1.0940860
## 35        HEAVY RAIN  0.7281698
## 67    TROPICAL STORM  0.6777110