Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. Here I show a brief anylyses using the NOAA storm database to find the top 10 weather events which cause greatest gamage to population health including fatalities and injuries and to social economy including property damages and crop damages. This report contains four parts. The first part I show the data processing. It includes restructure the dataset, subset to select the wanted data and transforming the data to a right format. It is an important step for data analyses later on. The second part and the third part are to find the top 10 worst weather events to population health and social economy, respectively. In the last part, I have some discussions about the analyses and have some suggestions to improve the analyses.

Loading libraries

library(dplyr)

Data Processing

reading the data directly from .bz2 file and checking the dimension of the data and the variables in the dataset.

dt <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", header = TRUE, sep = ",")
dim(dt)
## [1] 902297     37
names(dt)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Select only the wanted data for analysis

I first select only the variables which is related to analysis population helth and economic damages.

dt1 <- subset(dt, select = c(EVTYPE, BGN_DATE, FATALITIES, INJURIES,PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP ))

As mentioned in the exercise and the National weather service instructions, there are a lot of typos in the EVTYPE, PROPDMGEXP, and CROPDMGEXP. In addition, officially there should be only 48 weather events type. To really acheive 48 official weather events type need bunch of work. In the limitation of the exercise, here I used toupper command to reduce the typos in the dataset.

dt1$EVTYPE <- toupper(dt1$EVTYPE)
dt1$PROPDMGEXP <- toupper(dt1$PROPDMGEXP)
dt1$CROPDMGEXP <- toupper(dt1$CROPDMGEXP)

and convert the charactors in the data to the Date for later calcluation.

dt1$BGN_DATE <- as.Date(dt1$BGN_DATE, "%m/%d/%Y %H:%H:%S")

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events, we do not need to include all years (dates) which are earlier than Jan. 1996

dt2 <- subset(dt1, as.numeric(format(dt1$BGN_DATE, format = "%Y")) >=1996)

The final dataset which was used for analyses looks like this:

head(dt2)
##              EVTYPE   BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 248768 WINTER STORM 1996-01-06          0        0     380          K
## 248769      TORNADO 1996-01-11          0        0     100          K
## 248770    TSTM WIND 1996-01-11          0        0       3          K
## 248771    TSTM WIND 1996-01-11          0        0       5          K
## 248772    TSTM WIND 1996-01-11          0        0       2          K
## 248773         HAIL 1996-01-18          0        0       0           
##        CROPDMG CROPDMGEXP
## 248768      38          K
## 248769       0           
## 248770       0           
## 248771       0           
## 248772       0           
## 248773       0

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Subset to select only the data which indicates the fatalities to population health and the type of events. Also remove the rows of the dataset with FATALITIES=0.

dt_fatal <- dt2[!dt2$FATALITIES ==0, c(1,3) ]
# Sum the fatalities for each type of event
fatalities <- aggregate(FATALITIES ~ EVTYPE, dt_fatal, sum)
# rank the data with decreasing FATALITIES
fatalities <- arrange(fatalities, -FATALITIES)
# select the top 10 fatalities events.
top10_fatalities<- head(fatalities, 10)

According to my analysis the top 10 weather events that causes fatalities is:

top10_fatalities
##            EVTYPE FATALITIES
## 1  EXCESSIVE HEAT       1797
## 2         TORNADO       1511
## 3     FLASH FLOOD        887
## 4       LIGHTNING        651
## 5           FLOOD        414
## 6     RIP CURRENT        340
## 7       TSTM WIND        241
## 8            HEAT        237
## 9       HIGH WIND        235
## 10      AVALANCHE        223

Subset to select only the data which indicates the injuries to population health and the type of events. Also remove the rows of the dataset with INJURIES=0.

dt_injuries <- dt2[!dt2$INJURIES ==0, c(1,4)]
# Sum the injuries for each type of event
injuries <- aggregate(INJURIES ~ EVTYPE, dt_injuries, sum)
# rank the data with decreasing FATALITIES
injuries <- arrange(injuries, -INJURIES)
# select the top 10 fatalities events.
top10_injuries<- head(injuries, 10)

According to my analysis the top 10 weather events that causes injuries is

top10_injuries
##               EVTYPE INJURIES
## 1            TORNADO    20667
## 2              FLOOD     6758
## 3     EXCESSIVE HEAT     6391
## 4          LIGHTNING     4141
## 5          TSTM WIND     3629
## 6        FLASH FLOOD     1674
## 7  THUNDERSTORM WIND     1400
## 8       WINTER STORM     1292
## 9  HURRICANE/TYPHOON     1275
## 10              HEAT     1222

Plotting the top 10 most harmful events with respect to population health.

# creating figure 1 for the report
par(mfrow = c(1,2), mar = c(11, 3, 4,2))
# plot the top 10 harmful events of fatalities
barplot(top10_fatalities$FATALITIES, names.arg = top10_fatalities$EVTYPE, las = 2, col = "red", main = "Top 10 most harmful events\n of FATALITIES")
# plot the top 10 harmful events of injuries
barplot(top10_injuries$INJURIES, names.arg = top10_injuries$EVTYPE, las = 2, col = "blue", main = "Top 10 most harmful events\n of INJURIES")
Figure 1: The top 10 most harmful events with respect to population health; (left) for fatalities and (right) for injuries. y-axis indicates the number of fatallities or injuires.

Figure 1: The top 10 most harmful events with respect to population health; (left) for fatalities and (right) for injuries. y-axis indicates the number of fatallities or injuires.

2.Across the United States, which types of events have the greatest economic consequences?

Estimate the property damage for each events. The final result shows the top 10 total propety damage in US dollars.

# Subset to have the wanted data and remove the damaged property = 0
dt_PROP <- dt2[!dt2$PROPDMG == 0, c(1, 5:6)]
#head(dt_PROP)

check the factor of the PROPDMGEXP

unique(dt_PROP$PROPDMGEXP)
## [1] "K" "M" "B"

Here I calculated the total property damage in US for each weather events.

dt_PROP$damaged_in_dollars[dt_PROP$PROPDMGEXP == "K"] <- dt_PROP$PROPDMG*(10^3)
dt_PROP$damaged_in_dollars[dt_PROP$PROPDMGEXP == "M"] <- dt_PROP$PROPDMG*(10^6)
dt_PROP$damaged_in_dollars[dt_PROP$PROPDMGEXP == "B"] <- dt_PROP$PROPDMG*(10^9)
# Sum the real damaged properties for each type of event
PROP <- aggregate(damaged_in_dollars ~ EVTYPE, dt_PROP, sum)
# rank with decreasing damaged proteties values
PROP <- arrange(PROP, -damaged_in_dollars)
# select the top 10 damaged proteties events.
top10_PROP<- head(PROP, 10)

According to my analysis the top 10 weather events that causes money loss in damaged properties is

top10_PROP
##               EVTYPE damaged_in_dollars
## 1   STORM SURGE/TIDE       596002089200
## 2              FLOOD       478679649940
## 3        FLASH FLOOD       470363291660
## 4            TORNADO       285990100060
## 5          HURRICANE       256042180810
## 6  HURRICANE/TYPHOON       141966285180
## 7          HIGH WIND        89405954190
## 8               HAIL        42674775470
## 9          TSTM WIND        23538719340
## 10          WILDFIRE        17558931810

and here I calculated the total crop damage in US for each weather events.

# Subset to have the wanted data and remove the damaged CROP = 0
dt_CROP <- dt2[!dt2$CROPDMG == 0, c(1, 7:8)]

check the factor of the CROPDMGEXP

unique(dt_CROP$CROPDMGEXP)
## [1] "K" "M" "B"

Here I calculated the total crop damage in US for each weather events

dt_CROP$damaged_in_dollars[dt_CROP$CROPDMGEXP == "K"] <- dt_CROP$CROPDMG*(10^3)
dt_CROP$damaged_in_dollars[dt_CROP$CROPDMGEXP == "M"] <- dt_CROP$CROPDMG*(10^6)
dt_CROP$damaged_in_dollars[dt_CROP$CROPDMGEXP == "B"] <- dt_CROP$CROPDMG*(10^9)

# Sum the real damaged properties for each type of event
CROP <- aggregate(damaged_in_dollars ~ EVTYPE, dt_CROP, sum)
# rank with decreasing damaged proteties values
CROP <- arrange(CROP, -damaged_in_dollars)
# select the top 10 damaged proteties events.
top10_CROP<- head(CROP, 10)

According to my analysis the top 10 weather events that causes money loss in damaged crop is

top10_CROP
##               EVTYPE damaged_in_dollars
## 1  HURRICANE/TYPHOON        38579068000
## 2               HAIL        36165386760
## 3              FLOOD        26039891490
## 4        FLASH FLOOD        11147480670
## 5  THUNDERSTORM WIND         8859341490
## 6            DROUGHT         8183683000
## 7          TSTM WIND         6290490120
## 8       FROST/FREEZE         5779211600
## 9            TORNADO         4429597790
## 10        HEAVY RAIN         2231708700

Plotting the top 10 most harmful events with respect to the properties damage and the crops damage.

par(mfrow = c(1,2), mar = c(11, 5, 4, 2))
# plot the top 10 harmful events of damaged properties
barplot(top10_PROP$damaged_in_dollars, names.arg = top10_PROP$EVTYPE, las = 2, col = "red", main = "Top 10 harmful events\n of properties damage")
# plot the top 10 harmful events of injuries
barplot(top10_CROP$damaged_in_dollars, names.arg = top10_CROP$EVTYPE, las = 2, col = "blue", main = "Top 10 harmful events\n of crops damage")
Figure 2: The top 10 most harmful events with respect to the properties damage (left)  and the crops damage (right). y-axis indicates the number of the total damage in US dollars.

Figure 2: The top 10 most harmful events with respect to the properties damage (left) and the crops damage (right). y-axis indicates the number of the total damage in US dollars.

Discussions

This report shows a brieft analysis of the fatalities, injuries, damaged properties and damaged crops in United States using NOAA storm data. According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events, do I used only data which recorded later than Jan 1996. In addtion, the official events type are 48. However, if you use ‘unique’ function on ‘EVTYPE’ column you will get near one thousand events! All that is just typo. In this analyses I did not clarify typos in the events type. Final remark, this analyses also did not take into account for inflation and rectification.Handling inflation and adjustment the cost accordingly will make your analysis more accurate. Data rectification and error detection will be a good thing to do. The typo is not only in event types, some of the crop and property damage expenses are recorded wrongly. If we can find another source to check costs validity (at least the major ones), that will make our work more accurate? But keep in mind, we need our work to be reproducible!