1. Synopsis

Our objective in this report is to explore the impact of severe weather events on both public health and economy in the United States. In order to achive this objective we use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which covers the years from 1950 to 2011. More information about the NOAA storm database are available Online

In the rest of this report we try to answer the following two important questions:

  1. Which types of weather events across the United States are most harmful with respect to population health?

  2. Which types of weather events across the United States have the greatest economic consequences?

2. Data Loading and Processing

We first download the raw data from the given website url and then we unzip and load the dataset for processing and cleaning.

# downlaod the data
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL, destfile ="repdata%2Fdata%2FStormData.csv.bz2")

# load the data
storm_data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")

As initial task, we can check details of the database such as structure and variable names, dimensions, variable names, etc.

str(storm_data) 
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
names(storm_data) 
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

From these details, we notice it is better to change the variable names to be lower case.

names(storm_data) <- tolower(names(storm_data)) 

We can have a look at the improtant variables, of course the first important variable is the storm type (evtype), we look at a sample of 100 types as follows:

unique(storm_data$evtype)[sample(1:length(unique(storm_data$evtype)),100)]
##   [1] "HEAVY RAIN/URBAN FLOOD"         "Urban flood"                   
##   [3] "HIGH WINDS AND WIND CHILL"      "FROST/FREEZE"                  
##   [5] "HEAVY SURF/HIGH SURF"           "URBAN/SML STREAM FLDG"         
##   [7] "HURRICANE OPAL"                 "SEVERE THUNDERSTORM WINDS"     
##   [9] "THUNDERSTORM WIND 98 MPH"       "THUNDERSTORM WINDSS"           
##  [11] "HIGH WINDS 67"                  "THUNDERSTORM  WINDS"           
##  [13] "EXTREME/RECORD COLD"            "THUNDERSTORM WINDS 60"         
##  [15] "ICE AND SNOW"                   "Strong Wind"                   
##  [17] "URBAN FLOOD"                    "RECORD HIGH TEMPERATURES"      
##  [19] "Marine Accident"                "FLOOD/RAIN/WIND"               
##  [21] "SMALL HAIL"                     "COASTAL/TIDAL FLOOD"           
##  [23] "HURRICANE FELIX"                "THUNDERSTORM WINDS/HEAVY RAIN" 
##  [25] "LOW WIND CHILL"                 "FLASH FLOOD"                   
##  [27] "ASTRONOMICAL LOW TIDE"          "ICE PELLETS"                   
##  [29] "WATERSPOUT TORNADO"             "THUNDERSTORM WINDS G60"        
##  [31] "BLIZZARD WEATHER"               "SNOW SQUALL"                   
##  [33] "HEAVY SNOW & ICE"               "BLIZZARD/HEAVY SNOW"           
##  [35] "FLOOD & HEAVY RAIN"             "FLASH FLOOD/LANDSLIDE"         
##  [37] "WINTER STORMS"                  "RECORD HEAT WAVE"              
##  [39] "ABNORMALLY DRY"                 "WET SNOW"                      
##  [41] "WIND ADVISORY"                  "Summary of June 4"             
##  [43] "WIND AND WAVE"                  "HURRICANE EMILY"               
##  [45] "HIGH  SWELLS"                   "WARM DRY CONDITIONS"           
##  [47] "FLOOD WATCH/"                   "TORNDAO"                       
##  [49] "MAJOR FLOOD"                    "EXCESSIVE RAIN"                
##  [51] "LATE SEASON HAIL"               "LIGHTNING AND WINDS"           
##  [53] "Freezing drizzle"               "RECORD RAINFALL"               
##  [55] "WINTER STORM"                   "Summary of June 6"             
##  [57] "Summary of May 31 pm"           "COLD AND FROST"                
##  [59] "URBAN/SMALL STREAM"             "UNSEASONABLY WARM/WET"         
##  [61] "Summary of June 10"             "Microburst"                    
##  [63] "HEAT DROUGHT"                   "Gusty Wind"                    
##  [65] "URBAN/STREET FLOODING"          " WATERSPOUT"                   
##  [67] "APACHE COUNTY"                  "SMALL STREAM URBAN FLOOD"      
##  [69] "THUNDERESTORM WINDS"            "WILD FIRES"                    
##  [71] "FREEZING RAIN/SNOW"             "VOLCANIC ASH"                  
##  [73] "Summary August 4"               "EXCESSIVELY DRY"               
##  [75] "LAKE FLOOD"                     "STRONG WIND"                   
##  [77] "HEAVY SNOW/HIGH WIND"           "SMALL STREAM AND URBAN FLOODIN"
##  [79] "Metro Storm, May 26"            "DROUGHT/EXCESSIVE HEAT"        
##  [81] "BITTER WIND CHILL"              "LOCALLY HEAVY RAIN"            
##  [83] "BEACH EROSION/COASTAL FLOOD"    "HIGH WINDS"                    
##  [85] "THUNDERSTORM WINDS/FLASH FLOOD" "DUST DEVIL WATERSPOUT"         
##  [87] "HIGH WINDS 66"                  "Funnel Cloud"                  
##  [89] "WATERSPOUT/TORNADO"             "HVY RAIN"                      
##  [91] "RAIN/WIND"                      "STRONG WIND GUST"              
##  [93] "FREEZING RAIN AND SNOW"         "Freeze"                        
##  [95] "TSTM WIND 52"                   "River Flooding"                
##  [97] "RECORD TEMPERATURES"            "MICROBURST WINDS"              
##  [99] "Urban Flooding"                 "LAKESHORE FLOOD"

We can observe that there is a big problem in the names of these types: some types are the same with lower case and upper case letters, same types with different numbers at the end, and so on. We can follow simple steps to fix these issues. First, we change all the types names to be lower case, and sort the types according to their frequency and present types with at least 5000 fequency.

storm_data$evtype <- tolower(storm_data$evtype)

storm_data %>% group_by(evtype) %>% count() %>%
  arrange(desc(n)) %>% filter(n>5000) %>% kable()
evtype n
hail 288661
tstm wind 219942
thunderstorm wind 82564
tornado 60652
flash flood 54277
flood 25327
thunderstorm winds 20843
high wind 20214
lightning 15754
heavy snow 15708
heavy rain 11742
winter storm 11433
winter weather 7045
funnel cloud 6844
marine tstm wind 6175
marine thunderstorm wind 5812

From this table we find 16 storm types have frequencies greater than 5000; however, two types with different names: (1) thunderstorm wind with three names: tstm wind, thunderstorm wind, and thunderstorm winds, and (2) marine thunderstorm wind with two names: marine thunderstorm wind and marine tstm wind. We fix this issue in the following code:

storm_data <- storm_data %>%
  mutate(evtype = recode(evtype, 'tstm wind' = "thunderstorm wind",
                         'thunderstorm winds' = "thunderstorm wind",
                         'marine tstm wind' = "marine thunderstorm wind",
                         .default = NULL))
  
storm_data %>% group_by(evtype) %>% count() %>%
  arrange(desc(n)) %>% filter(n>5000) %>% kable()
evtype n
thunderstorm wind 323349
hail 288661
tornado 60652
flash flood 54277
flood 25327
high wind 20214
lightning 15754
heavy snow 15708
marine thunderstorm wind 11987
heavy rain 11742
winter storm 11433
winter weather 7045
funnel cloud 6844

Based on our objective and the two research questions which we want answer, we can specify our analysis variables in three groups. Indeed, all the information about these variables are available online from the National Climatic Data Center Storm Events FAQ and the National Weather Service storm data documentation. The three group of variables are listed as follows:

  1. Variables are related to weather events: which include storm type (evtype).

  2. Variables are related to weather events impact on public health: which include fatalities and injuries.

  3. Variables are related to weather events impact on economy: which include property damage (propdmg and propdmgexp) and crop damage (cropdmg and cropdmgexp). In each one of these two sets, the first variable gives the first three significant digits and the second variable gives the exponents. Accordingly, we can clean the data and remove all the other irrelevant variables.

# select relevant variables 
data_cleaned <- storm_data %>% select(evtype, fatalities, 
               injuries, propdmg, propdmgexp, cropdmg, cropdmgexp)

str(data_cleaned)
## 'data.frame':    902297 obs. of  7 variables:
##  $ evtype    : chr  "tornado" "tornado" "tornado" "tornado" ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propdmg   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ propdmgexp: chr  "K" "K" "K" "K" ...
##  $ cropdmg   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropdmgexp: chr  "" "" "" "" ...

Now, it is the time to convert the exponent variables (propdmgexp and cropdmgexp) to be actual numeric exponents variables and use them to calculate property and crop cost (propcost and cropcost) as follows:

data_cleaned$propdmgexp <- tolower(data_cleaned$propdmgexp) 
data_cleaned$cropdmgexp <- tolower(data_cleaned$cropdmgexp) 

data_cleaned <- data_cleaned %>%
  mutate(propdmgexp = recode(propdmgexp,
                             "-" = 10^0,
                             "+" = 10^0,
                             "0" = 10^0,
                             "1" = 10^1,
                             "2" = 10^2,
                             "3" = 10^3,
                             "4" = 10^4,
                             "5" = 10^5,
                             "6" = 10^6,
                             "7" = 10^7,
                             "8" = 10^8,
                             "9" = 10^9,
                             "h" = 10^2,
                             "k" = 10^3,
                             "m" = 10^6,
                             "b" = 10^9,
                             .default = 10^0),
         cropdmgexp = recode(cropdmgexp,
                             "?" = 10^0, 
                             "0" = 10^0,
                             "k" = 10^3,
                             "m" = 10^6,
                             "b" = 10^9,
                             .default = 10^0),
         propcost = propdmg*propdmgexp,
         cropcost = cropdmg*cropdmgexp) %>% 
  select(evtype, fatalities, 
               injuries, propcost, cropcost) 
str(data_cleaned)
## 'data.frame':    902297 obs. of  5 variables:
##  $ evtype    : chr  "tornado" "tornado" "tornado" "tornado" ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propcost  : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ cropcost  : num  0 0 0 0 0 0 0 0 0 0 ...

3. Main Results

We compute the total amount of damage in fatalities, injuries, properties, and crops per severe weather event type, and present in tables the top 10 causes of damage for each type.

Totals <- data_cleaned %>% group_by(evtype) %>%
  summarize(total_fatalities  = sum(fatalities, na.rm = T),
            total_injuries  = sum(injuries, na.rm = T),
            total_propdmg  = sum(propcost, na.rm = T),
            total_cropdmg  = sum(cropcost, na.rm = T)) %>%
  ungroup() 
## `summarise()` ungrouping output (override with `.groups` argument)
total_fatalities <- Totals %>% 
  select(evtype, total = total_fatalities) %>%
  arrange(desc(total)) %>% head(10) 

total_fatalities %>% kable()
evtype total
tornado 5633
excessive heat 1903
flash flood 978
heat 937
lightning 816
thunderstorm wind 701
flood 470
rip current 368
high wind 248
avalanche 224
total_injuries <- Totals %>% 
  select(evtype, total = total_injuries) %>%
  arrange(desc(total)) %>% head(10) 

total_injuries %>% kable()
evtype total
tornado 91346
thunderstorm wind 9353
flood 6789
excessive heat 6525
lightning 5230
heat 2100
ice storm 1975
flash flood 1777
hail 1361
winter storm 1321
total_propdmg <- Totals %>% 
  select(evtype, total = total_propdmg) %>%
  arrange(desc(total)) %>% head(10) 

total_propdmg%>% kable()
evtype total
flood 144657709807
hurricane/typhoon 69305840000
tornado 56947380677
storm surge 43323536000
flash flood 16822673979
hail 15735267513
hurricane 11868319010
thunderstorm wind 9912671826
tropical storm 7703890550
winter storm 6688497251
total_cropdmg <- Totals %>% 
  select(evtype, total = total_cropdmg) %>%
  arrange(desc(total)) %>% head(10) 

total_cropdmg%>% kable()
evtype total
drought 13972566000
flood 5661968450
river flood 5029459000
ice storm 5022113500
hail 3025954473
hurricane 2741910000
hurricane/typhoon 2607872800
flash flood 1421317100
extreme cold 1312973000
thunderstorm wind 1159505188

Now, we can plot the total amount of damage in fatalities and injuries in one graph to compare them and evaluate the impact on publich health.

par(mfrow = c(1, 2), mar = c(10, 5, 3, 2))
barplot(total_fatalities$total/10^3, las = 3, names.arg = total_fatalities$evtype, col = "black", main = "Top 10 causes for Fatalities", cex.names = 0.8, ylab = "Total Fatalities (in thousands)")
barplot(total_injuries$total/10^3, las = 3, names.arg = total_injuries$evtype, col = "red", main = "Top 10 causes for Injuries", cex.names = 0.8, ylab = "Total Injuries (in thousands)")

Similarly, we plot the total amount of damage in crops and properties in one graph to compare them and evaluate the impact on economy.

par(mfrow = c(1, 2), mar = c(10, 5, 3, 1))
barplot(total_propdmg$total/10^9, las = 3, names.arg = total_propdmg$evtype, col = "gray", main = "Top 10 causes for properties damage", cex.names = 0.8, ylab = "Properties damage (in billion dollars)")
barplot(total_cropdmg$total/10^9, las = 3, names.arg = total_cropdmg$evtype, col = "green", main = "Top 10 causes for crops damage", cex.names = 0.8, ylab = "Crops damage (in billion dollars)")

4. Conclusion

From our analysis result, we can claim that Tornadoes, thunderstorm winds, and excessive heat are the most damaging event for the public health as reflected by the number of fatalities and injuries. On the other hand, flood, hurricane/typhon, and drought have the greatest economic consequences as represented in the total amount of damage in properies and crops.