Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. Here I show a brief anylyses using the NOAA storm database to find the top 10 weather events which cause greatest gamage to population health including fatalities and injuries and to social economy including property damages and crop damages. This report contains four parts. The first part I show the data processing. It includes restructure the dataset, subset to select the wanted data and transforming the data to a right format. It is an important step for data analyses later on. The second part and the third part are to find the top 10 worst weather events to population health and social economy, respectively. In the last part, I have some discussions about the analyses and have some suggestions to improve the analyses.
library(dplyr)
reading the data directly from .bz2 file and checking the dimension of the data and the variables in the dataset.
dt <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", header = TRUE, sep = ",")
dim(dt)
## [1] 902297 37
names(dt)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
I first select only the variables which is related to analysis population helth and economic damages.
dt1 <- subset(dt, select = c(EVTYPE, BGN_DATE, FATALITIES, INJURIES,PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP ))
As mentioned in the exercise and the National weather service instructions, there are a lot of typos in the EVTYPE, PROPDMGEXP, and CROPDMGEXP. In addition, officially there should be only 48 weather events type. To really acheive 48 official weather events type need bunch of work. In the limitation of the exercise, here I used toupper command to reduce the typos in the dataset.
dt1$EVTYPE <- toupper(dt1$EVTYPE)
dt1$PROPDMGEXP <- toupper(dt1$PROPDMGEXP)
dt1$CROPDMGEXP <- toupper(dt1$CROPDMGEXP)
and convert the charactors in the data to the Date for later calcluation.
dt1$BGN_DATE <- as.Date(dt1$BGN_DATE, "%m/%d/%Y %H:%H:%S")
According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events, we do not need to include all years (dates) which are earlier than Jan. 1996
dt2 <- subset(dt1, as.numeric(format(dt1$BGN_DATE, format = "%Y")) >=1996)
The final dataset which was used for analyses looks like this:
head(dt2)
## EVTYPE BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 248768 WINTER STORM 1996-01-06 0 0 380 K
## 248769 TORNADO 1996-01-11 0 0 100 K
## 248770 TSTM WIND 1996-01-11 0 0 3 K
## 248771 TSTM WIND 1996-01-11 0 0 5 K
## 248772 TSTM WIND 1996-01-11 0 0 2 K
## 248773 HAIL 1996-01-18 0 0 0
## CROPDMG CROPDMGEXP
## 248768 38 K
## 248769 0
## 248770 0
## 248771 0
## 248772 0
## 248773 0
Subset to select only the data which indicates the fatalities to population health and the type of events. Also remove the rows of the dataset with FATALITIES=0.
dt_fatal <- dt2[!dt2$FATALITIES ==0, c(1,3) ]
# Sum the fatalities for each type of event
fatalities <- aggregate(FATALITIES ~ EVTYPE, dt_fatal, sum)
# rank the data with decreasing FATALITIES
fatalities <- arrange(fatalities, -FATALITIES)
# select the top 10 fatalities events.
top10_fatalities<- head(fatalities, 10)
According to my analysis the top 10 weather events that causes fatalities is:
top10_fatalities
## EVTYPE FATALITIES
## 1 EXCESSIVE HEAT 1797
## 2 TORNADO 1511
## 3 FLASH FLOOD 887
## 4 LIGHTNING 651
## 5 FLOOD 414
## 6 RIP CURRENT 340
## 7 TSTM WIND 241
## 8 HEAT 237
## 9 HIGH WIND 235
## 10 AVALANCHE 223
Subset to select only the data which indicates the injuries to population health and the type of events. Also remove the rows of the dataset with INJURIES=0.
dt_injuries <- dt2[!dt2$INJURIES ==0, c(1,4)]
# Sum the injuries for each type of event
injuries <- aggregate(INJURIES ~ EVTYPE, dt_injuries, sum)
# rank the data with decreasing FATALITIES
injuries <- arrange(injuries, -INJURIES)
# select the top 10 fatalities events.
top10_injuries<- head(injuries, 10)
According to my analysis the top 10 weather events that causes injuries is
top10_injuries
## EVTYPE INJURIES
## 1 TORNADO 20667
## 2 FLOOD 6758
## 3 EXCESSIVE HEAT 6391
## 4 LIGHTNING 4141
## 5 TSTM WIND 3629
## 6 FLASH FLOOD 1674
## 7 THUNDERSTORM WIND 1400
## 8 WINTER STORM 1292
## 9 HURRICANE/TYPHOON 1275
## 10 HEAT 1222
# creating figure 1 for the report
par(mfrow = c(1,2), mar = c(11, 3, 4,2))
# plot the top 10 harmful events of fatalities
barplot(top10_fatalities$FATALITIES, names.arg = top10_fatalities$EVTYPE, las = 2, col = "red", main = "Top 10 most harmful events\n of FATALITIES")
# plot the top 10 harmful events of injuries
barplot(top10_injuries$INJURIES, names.arg = top10_injuries$EVTYPE, las = 2, col = "blue", main = "Top 10 most harmful events\n of INJURIES")
Figure 1: The top 10 most harmful events with respect to population health; (left) for fatalities and (right) for injuries. y-axis indicates the number of fatallities or injuires.
Estimate the property damage for each events. The final result shows the top 10 total propety damage in US dollars.
# Subset to have the wanted data and remove the damaged property = 0
dt_PROP <- dt2[!dt2$PROPDMG == 0, c(1, 5:6)]
#head(dt_PROP)
check the factor of the PROPDMGEXP
unique(dt_PROP$PROPDMGEXP)
## [1] "K" "M" "B"
Here I calculated the total property damage in US for each weather events.
dt_PROP$damaged_in_dollars[dt_PROP$PROPDMGEXP == "K"] <- dt_PROP$PROPDMG*(10^3)
dt_PROP$damaged_in_dollars[dt_PROP$PROPDMGEXP == "M"] <- dt_PROP$PROPDMG*(10^6)
dt_PROP$damaged_in_dollars[dt_PROP$PROPDMGEXP == "B"] <- dt_PROP$PROPDMG*(10^9)
# Sum the real damaged properties for each type of event
PROP <- aggregate(damaged_in_dollars ~ EVTYPE, dt_PROP, sum)
# rank with decreasing damaged proteties values
PROP <- arrange(PROP, -damaged_in_dollars)
# select the top 10 damaged proteties events.
top10_PROP<- head(PROP, 10)
According to my analysis the top 10 weather events that causes money loss in damaged properties is
top10_PROP
## EVTYPE damaged_in_dollars
## 1 STORM SURGE/TIDE 596002089200
## 2 FLOOD 478679649940
## 3 FLASH FLOOD 470363291660
## 4 TORNADO 285990100060
## 5 HURRICANE 256042180810
## 6 HURRICANE/TYPHOON 141966285180
## 7 HIGH WIND 89405954190
## 8 HAIL 42674775470
## 9 TSTM WIND 23538719340
## 10 WILDFIRE 17558931810
and here I calculated the total crop damage in US for each weather events.
# Subset to have the wanted data and remove the damaged CROP = 0
dt_CROP <- dt2[!dt2$CROPDMG == 0, c(1, 7:8)]
check the factor of the CROPDMGEXP
unique(dt_CROP$CROPDMGEXP)
## [1] "K" "M" "B"
Here I calculated the total crop damage in US for each weather events
dt_CROP$damaged_in_dollars[dt_CROP$CROPDMGEXP == "K"] <- dt_CROP$CROPDMG*(10^3)
dt_CROP$damaged_in_dollars[dt_CROP$CROPDMGEXP == "M"] <- dt_CROP$CROPDMG*(10^6)
dt_CROP$damaged_in_dollars[dt_CROP$CROPDMGEXP == "B"] <- dt_CROP$CROPDMG*(10^9)
# Sum the real damaged properties for each type of event
CROP <- aggregate(damaged_in_dollars ~ EVTYPE, dt_CROP, sum)
# rank with decreasing damaged proteties values
CROP <- arrange(CROP, -damaged_in_dollars)
# select the top 10 damaged proteties events.
top10_CROP<- head(CROP, 10)
According to my analysis the top 10 weather events that causes money loss in damaged crop is
top10_CROP
## EVTYPE damaged_in_dollars
## 1 HURRICANE/TYPHOON 38579068000
## 2 HAIL 36165386760
## 3 FLOOD 26039891490
## 4 FLASH FLOOD 11147480670
## 5 THUNDERSTORM WIND 8859341490
## 6 DROUGHT 8183683000
## 7 TSTM WIND 6290490120
## 8 FROST/FREEZE 5779211600
## 9 TORNADO 4429597790
## 10 HEAVY RAIN 2231708700
par(mfrow = c(1,2), mar = c(11, 5, 4, 2))
# plot the top 10 harmful events of damaged properties
barplot(top10_PROP$damaged_in_dollars, names.arg = top10_PROP$EVTYPE, las = 2, col = "red", main = "Top 10 harmful events\n of properties damage")
# plot the top 10 harmful events of injuries
barplot(top10_CROP$damaged_in_dollars, names.arg = top10_CROP$EVTYPE, las = 2, col = "blue", main = "Top 10 harmful events\n of crops damage")
Figure 2: The top 10 most harmful events with respect to the properties damage (left) and the crops damage (right). y-axis indicates the number of the total damage in US dollars.
This report shows a brieft analysis of the fatalities, injuries, damaged properties and damaged crops in United States using NOAA storm data. According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events, do I used only data which recorded later than Jan 1996. In addtion, the official events type are 48. However, if you use ‘unique’ function on ‘EVTYPE’ column you will get near one thousand events! All that is just typo. In this analyses I did not clarify typos in the events type. Final remark, this analyses also did not take into account for inflation and rectification.Handling inflation and adjustment the cost accordingly will make your analysis more accurate. Data rectification and error detection will be a good thing to do. The typo is not only in event types, some of the crop and property damage expenses are recorded wrongly. If we can find another source to check costs validity (at least the major ones), that will make our work more accurate? But keep in mind, we need our work to be reproducible!