Synopsis

Weather events can cause public health and economic problems. Severe events result in fatalities, injuries, and property damage.

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database will be explored. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, which type of event, as well as the estimates of relevant fatalities, injuries, and various forms of damage.

This analysis examines the damaging effects of severe weather conditions (e.g. thunderstorms, floods, etc.) on human populations and the econonomy in the U.S. from 1950 to 2011. As a result, the analysis will highlight the severe weather events associated with the greatest impact on the economy and population health.

Data Processing

Loading necessary R packages:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading the dataset into the R environment:

The data was downloaded from the URL provided in the course project page. The file was unzipped and extracted to the working directory manually.

data = read.csv("repdata_data_StormData.csv")

Getting an overview of the data

names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Keeping and processing only the useful data:

The dataset consists of a lot of variables (columns) which are not required. Therefore, only the project relevant columns will be kept.

imp = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", 
        "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
rel_data = data[imp]

Processing property damage data:

We will list the property damage exponents (PROPDMGEXP) for each leveland assigne those values for the property exponent data. Invalid data will be excluded by assigning the value as ‘0’. Then property damage value will be calculated by multiplying the property damage and property exponent value.

unique(rel_data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# Assigning values for the property exponent rel_data 
rel_data$PROPEXP[rel_data$PROPDMGEXP == "K"] = 1000
rel_data$PROPEXP[rel_data$PROPDMGEXP == "M"] = 1e+06
rel_data$PROPEXP[rel_data$PROPDMGEXP == ""] = 1
rel_data$PROPEXP[rel_data$PROPDMGEXP == "B"] = 1e+09
rel_data$PROPEXP[rel_data$PROPDMGEXP == "m"] = 1e+06
rel_data$PROPEXP[rel_data$PROPDMGEXP == "0"] = 1
rel_data$PROPEXP[rel_data$PROPDMGEXP == "5"] = 1e+05
rel_data$PROPEXP[rel_data$PROPDMGEXP == "6"] = 1e+06
rel_data$PROPEXP[rel_data$PROPDMGEXP == "4"] = 10000
rel_data$PROPEXP[rel_data$PROPDMGEXP == "2"] = 100
rel_data$PROPEXP[rel_data$PROPDMGEXP == "3"] = 1000
rel_data$PROPEXP[rel_data$PROPDMGEXP == "h"] = 100
rel_data$PROPEXP[rel_data$PROPDMGEXP == "7"] = 1e+07
rel_data$PROPEXP[rel_data$PROPDMGEXP == "H"] = 100
rel_data$PROPEXP[rel_data$PROPDMGEXP == "1"] = 10
rel_data$PROPEXP[rel_data$PROPDMGEXP == "8"] = 1e+08

# Assigning '0' to invalid exponent rel_data
rel_data$PROPEXP[rel_data$PROPDMGEXP == "+"] = 0
rel_data$PROPEXP[rel_data$PROPDMGEXP == "-"] = 0
rel_data$PROPEXP[rel_data$PROPDMGEXP == "?"] = 0

# Calculating the property damage value
rel_data$PROPDMGVAL = rel_data$PROPDMG * rel_data$PROPEXP

Processing crop damage data:

We will use the same process for crop damage data.

unique(rel_data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
# Assigning values for the crop exponent rel_data 
rel_data$CROPEXP[rel_data$CROPDMGEXP == "M"] = 1e+06
rel_data$CROPEXP[rel_data$CROPDMGEXP == "K"] = 1000
rel_data$CROPEXP[rel_data$CROPDMGEXP == "m"] = 1e+06
rel_data$CROPEXP[rel_data$CROPDMGEXP == "B"] = 1e+09
rel_data$CROPEXP[rel_data$CROPDMGEXP == "0"] = 1
rel_data$CROPEXP[rel_data$CROPDMGEXP == "k"] = 1000
rel_data$CROPEXP[rel_data$CROPDMGEXP == "2"] = 100
rel_data$CROPEXP[rel_data$CROPDMGEXP == ""] = 1

# Assigning '0' to invalid exponent rel_data
rel_data$CROPEXP[rel_data$CROPDMGEXP == "?"] = 0

# calculating the crop damage value
rel_data$CROPDMGVAL = rel_data$CROPDMG * rel_data$CROPEXP

Across the United States, which types of events are the most harmful with respect to population health?

We will find the events that caused the most fatalities (top 10). For this, we will use the aggregrate() function to sum the fatalities caused by each event and then use dplyr package’s %>% (chaining) operator (to save some typing) and arrange() function to easily arrange the rows to see which events caused the highest no. of fatalities.

fatalities = aggregate(FATALITIES ~ EVTYPE, data = rel_data, sum)
ordered_fatalities = fatalities %>% arrange(desc(FATALITIES))
t10f = ordered_fatalities[1:10,]
t10f
##            EVTYPE FATALITIES
## 1         TORNADO       5633
## 2  EXCESSIVE HEAT       1903
## 3     FLASH FLOOD        978
## 4            HEAT        937
## 5       LIGHTNING        816
## 6       TSTM WIND        504
## 7           FLOOD        470
## 8     RIP CURRENT        368
## 9       HIGH WIND        248
## 10      AVALANCHE        224

Next, we’ll do the same to find out the events that caused the highest no. of injuries.

injuries = aggregate(INJURIES ~ EVTYPE, data = rel_data, sum)
ordered_injuries = injuries %>% arrange(desc(INJURIES))
t10i = ordered_injuries[1:10,]
t10i
##               EVTYPE INJURIES
## 1            TORNADO    91346
## 2          TSTM WIND     6957
## 3              FLOOD     6789
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1777
## 9  THUNDERSTORM WIND     1488
## 10              HAIL     1361

Now, we can plot our findings:

par(mfrow = c(1,2), mar = c(11, 5, 3, 2), mgp = c(3,1,0), cex = 0.8)
barplot(t10f$FATALITIES, names.arg = t10f$EVTYPE, col = "red",
        las = 3,
        ylab = "Fatalities", 
        main = "Top 10 Fatalities")
barplot(t10i$INJURIES, names.arg = t10i$EVTYPE, col = "yellow", 
        las = 3,
        ylab = "Injuries", 
        main = "Top 10 Injuries")

Figure 1: Top 10 events causing the highest no. of fatalities and injuries.

Across the United States, which types of events have the greatest economic consequences?

The two significant damages to the economy are the property and crop damages. Upon closer examination, we see that the columns PROPDMG and CROPDMG are related to PROPDMGEXP and CROPDMGEXP columns. The data has already been processed in the Data Processing section.

We repeat the same steps as we took in calculating the fatalities and injuries from different events.

Finding events causing highest property damage:

prop = aggregate(PROPDMGVAL ~ EVTYPE, data = rel_data, sum)
ordered_prop = prop %>% arrange(desc(PROPDMGVAL))
t10p = ordered_prop[1:10,]
t10p
##               EVTYPE   PROPDMGVAL
## 1              FLOOD 144657709807
## 2  HURRICANE/TYPHOON  69305840000
## 3            TORNADO  56947380617
## 4        STORM SURGE  43323536000
## 5        FLASH FLOOD  16822673979
## 6               HAIL  15735267513
## 7          HURRICANE  11868319010
## 8     TROPICAL STORM   7703890550
## 9       WINTER STORM   6688497251
## 10         HIGH WIND   5270046260

Finding events causing highest crop damage:

crop = aggregate(CROPDMGVAL ~ EVTYPE, data = rel_data, sum)
ordered_crop = crop %>% arrange(desc(CROPDMGVAL))
t10c = ordered_crop[1:10,]
t10c
##               EVTYPE  CROPDMGVAL
## 1            DROUGHT 13972566000
## 2              FLOOD  5661968450
## 3        RIVER FLOOD  5029459000
## 4          ICE STORM  5022113500
## 5               HAIL  3025954473
## 6          HURRICANE  2741910000
## 7  HURRICANE/TYPHOON  2607872800
## 8        FLASH FLOOD  1421317100
## 9       EXTREME COLD  1292973000
## 10      FROST/FREEZE  1094086000

Now, we plot our findings:

par(mfrow = c(1,2), mar = c(11, 5, 3, 2), mgp = c(3,1,0), cex = 0.8)
barplot(t10p$PROPDMGVAL/(10^9), names.arg = t10p$EVTYPE, col = "red",
        las = 3,
        ylab = "Property Damage (Billion USD)", 
        main = "Top 10 Property Damages")
barplot(t10c$CROPDMGVAL/(10^9), names.arg = t10c$EVTYPE, col = "yellow", 
        las = 3,
        ylab = "Crop Damage (Billion USD)", 
        main = "Top 10 Crop Damages")

Figure 2: Top 10 events causing the highest economic damage.

Results

  1. Tornados have caused the highest number of fatalities (5633) as well as injuries (91346), followed by excessive heat for fatalities (1903) and thunderstorm winds for injuries (6957).

  2. Floods have caused the highest property damage (~144 billion USD), while droughts have caused the highest crop damage (~13.9 billion USD)