This project uses the data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database collected in USA over years 1950-2011, which provide information about effects of different weather events on human fatalities and injuries as well as property damages and crop damages caused by these events [1, 2]. We performed a basic exploratory analysis of this data. In particular the project attempted to answer the following questions: (i) which types of events were most harmful with respect to population health? (ii) which types of events have the greatest economic consequences?
Here we found that, in the US during years 1996-2011, tornados, heat, and floods were the most harmful events affecting public health. We also found that floods, hurricanes, and storm surges had the biggest negative economic impact.
The data were obtained from NOAA Storm Database as a files compressed with bz2 [3].
Loading packages which we will use for analysis.
library(dplyr)
library(tidyr)
library(ggplot2)
Reading raw data:
raw_data <- read.csv("storm_data.csv.bz2", na.strings = c("NA",""))
dim(raw_data)
## [1] 902297 37
There are 902297 rows and 37 columns in this dataset.
The dataset contains following variables:
names(raw_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Subseting the data to obtain data required for our analysis according to NOAA coodebook [2]:
data <- subset(raw_data, select = c("BGN_DATE","EVTYPE", "INJURIES",
"FATALITIES", "CROPDMG", "CROPDMGEXP",
"PROPDMG", "PROPDMGEXP"))
names(data)
## [1] "BGN_DATE" "EVTYPE" "INJURIES" "FATALITIES" "CROPDMG"
## [6] "CROPDMGEXP" "PROPDMG" "PROPDMGEXP"
str(data)
## 'data.frame': 902297 obs. of 8 variables:
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
summary(data)
## BGN_DATE EVTYPE INJURIES
## 5/25/2011 0:00:00: 1202 HAIL :288661 Min. : 0.0000
## 4/27/2011 0:00:00: 1193 TSTM WIND :219940 1st Qu.: 0.0000
## 6/9/2011 0:00:00 : 1030 THUNDERSTORM WIND: 82563 Median : 0.0000
## 5/30/2004 0:00:00: 1016 TORNADO : 60652 Mean : 0.1557
## 4/4/2011 0:00:00 : 1009 FLASH FLOOD : 54277 3rd Qu.: 0.0000
## 4/2/2006 0:00:00 : 981 FLOOD : 25326 Max. :1700.0000
## (Other) :895866 (Other) :170878
## FATALITIES CROPDMG CROPDMGEXP PROPDMG
## Min. : 0.0000 Min. : 0.000 K :281832 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.000 M : 1994 1st Qu.: 0.00
## Median : 0.0000 Median : 0.000 k : 21 Median : 0.00
## Mean : 0.0168 Mean : 1.527 0 : 19 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 B : 9 3rd Qu.: 0.50
## Max. :583.0000 Max. :990.000 (Other): 9 Max. :5000.00
## NA's :618413
## PROPDMGEXP
## K :424665
## M : 11330
## 0 : 216
## B : 40
## 5 : 28
## (Other): 84
## NA's :465934
unique(data$CROPDMGEXP)
## [1] <NA> M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
unique(data$PROPDMGEXP)
## [1] K M <NA> B m + 0 5 6 ? 4 2 3 h
## [15] 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
According to NOAA codebook, the variables CROPDMGEXP and PROPDMGEXP adjust the dollar amounts in billions(B), millions (M), or thousands (K) for the columns CROPDMG and PROPDMG, respectively. We can see that there are some characters in the CROPDMGEXP and PROPDMGEXP columns (such as -, ?, +, 0-9), which are not described in the codebook and are most probably mistakes of data entry. There are also many NAs in CROPDMGEXP and PROPDMGEXP that reflect that values in these entries of CROPDMG and PROPDMG do not need an adjustment. We will use only fields containing “B/b”, “M/m” or “K/k” for the purpose of calculating dollar amounts of CROPDMG and PROPDMG.
data[,9] <- 0
data[,6] <- as.character(data[,6])
data[grep("B|b", data$CROPDMGEXP), 9] <- 9
data[grep("M|m", data$CROPDMGEXP), 9] <- 6
data[grep("K|k", data$CROPDMGEXP), 9] <- 3
data[,5] <- data$CROPDMG*(10^data[,9])
data[,10] <- 0
data[,8] <- as.character(data[,8])
data[grep("B|b", data$PROPDMGEXP), 10] <- 9
data[grep("M|m", data$PROPDMGEXP), 10] <- 6
data[grep("K|k", data$PROPDMGEXP), 10] <- 3
data[,7] <- data$PROPDMG*(10^data[,10])
Now, we can remove the columns used for adjustments:
data <- subset(data, select = c("BGN_DATE","EVTYPE", "INJURIES", "FATALITIES",
"CROPDMG", "PROPDMG"))
summary(data)
## BGN_DATE EVTYPE INJURIES
## 5/25/2011 0:00:00: 1202 HAIL :288661 Min. : 0.0000
## 4/27/2011 0:00:00: 1193 TSTM WIND :219940 1st Qu.: 0.0000
## 6/9/2011 0:00:00 : 1030 THUNDERSTORM WIND: 82563 Median : 0.0000
## 5/30/2004 0:00:00: 1016 TORNADO : 60652 Mean : 0.1557
## 4/4/2011 0:00:00 : 1009 FLASH FLOOD : 54277 3rd Qu.: 0.0000
## 4/2/2006 0:00:00 : 981 FLOOD : 25326 Max. :1700.0000
## (Other) :895866 (Other) :170878
## FATALITIES CROPDMG PROPDMG
## Min. : 0.0000 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.0000 1st Qu.:0.000e+00 1st Qu.:0.000e+00
## Median : 0.0000 Median :0.000e+00 Median :0.000e+00
## Mean : 0.0168 Mean :5.442e+04 Mean :4.736e+05
## 3rd Qu.: 0.0000 3rd Qu.:0.000e+00 3rd Qu.:5.000e+02
## Max. :583.0000 Max. :5.000e+09 Max. :1.150e+11
##
Since we are interested in comparing data for all types of recorded events and according to NOAA website the data for all events started to be recorded from year 1996, we will subset the dataset starting from year 1996.
data1996 <- mutate(data, DATE = as.Date(as.character(BGN_DATE), "%m/%d/%Y")) %>%
filter(DATE > as.Date("1995-12-31"))
summary(data1996)
## BGN_DATE EVTYPE INJURIES
## 5/25/2011 0:00:00: 1202 HAIL :207715 Min. :0.00e+00
## 4/27/2011 0:00:00: 1193 TSTM WIND :128662 1st Qu.:0.00e+00
## 6/9/2011 0:00:00 : 1030 THUNDERSTORM WIND: 81402 Median :0.00e+00
## 5/30/2004 0:00:00: 1016 FLASH FLOOD : 50999 Mean :8.87e-02
## 4/4/2011 0:00:00 : 1009 FLOOD : 24247 3rd Qu.:0.00e+00
## 4/2/2006 0:00:00 : 981 TORNADO : 23154 Max. :1.15e+03
## (Other) :647099 (Other) :137351
## FATALITIES CROPDMG PROPDMG
## Min. : 0.00000 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.00000 1st Qu.:0.000e+00 1st Qu.:0.000e+00
## Median : 0.00000 Median :0.000e+00 Median :0.000e+00
## Mean : 0.01336 Mean :5.318e+04 Mean :5.612e+05
## 3rd Qu.: 0.00000 3rd Qu.:0.000e+00 3rd Qu.:1.250e+03
## Max. :158.00000 Max. :1.510e+09 Max. :1.150e+11
##
## DATE
## Min. :1996-01-01
## 1st Qu.:2000-11-21
## Median :2005-05-14
## Mean :2004-10-25
## 3rd Qu.:2008-08-22
## Max. :2011-11-30
##
We have noticed that in NOAA codebook some very similar types of events are separated into different groups (for example, “HEAT” and “EXCESSIVE HEAT”, or “TSTM WIND”, THUNDERSTORM WIND" and “THUNDERSTORM WINDS”) and should probably be considered as a single category. We have not combined them into a single category, because we want to be consistent with NOAA codebook. However, we will consider the combined impact of these similar events at the final stage of our analysis.
Now we will determine which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health across the United States. The two types of variables in this dataset that are considered harmful are FATALITIES and INJURIES.
harm <- select(data1996, EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(fatalities = sum(FATALITIES), injuries = sum(INJURIES)) %>%
arrange(desc(fatalities+injuries))
harm <- harm[1:30,] # select top 30 events
harm_tidy <- gather(harm, harm_type, count, fatalities, injuries)
ggplot(harm_tidy, aes(x = reorder(EVTYPE, count), fill = harm_type)) +
geom_bar(aes(y=count), stat = "identity", position = "stack") +
xlab("") +
ylab("Number of fatalities and injuries") +
ggtitle("The weather events with highest impact on population health") +
theme(axis.text.x = element_text(colour="grey20",size=12),
axis.text.y = element_text(colour="grey20",size=12),
axis.title.x = element_text(size=14, vjust = -0.2),
axis.title.y = element_text(size=14),
title = element_text(size = 14, vjust = 1.5),
legend.text = element_text(size=14)) +
coord_flip()
Figure 1. The weather events which have highest impact on population health (fatalities and injuries) in USA (years 1996-2011). Top 30 most harmful events are shown.
We can see from Figure 1 that the top three harmful events years 1996-2011 in USA were:
Now we will determine which types of events have the greatest economic consequences. Two variables in this dataset that reflect economic impact are property damage (PROPDMG) and crop damage (CROPDMG).
damage <- select(data1996, EVTYPE, PROPDMG, CROPDMG) %>%
group_by(EVTYPE) %>%
summarise(property = sum(PROPDMG), crops = sum(CROPDMG)) %>%
arrange(desc(property+crops))
damage <- damage[1:30,] # select top 30 events
damage_tidy <- gather(damage, damage_type, dollars, property, crops)
ggplot(damage_tidy, aes(x = reorder(EVTYPE, dollars), fill = damage_type)) +
geom_bar(aes(y=dollars/1000000000), stat = "identity", position = "stack") +
xlab("") +
ylab("Economical damage, billions of US dollars") +
ggtitle("The weather events with highest impact on economy") +
theme(axis.text.x = element_text(colour="grey20",size=12),
axis.text.y = element_text(colour="grey20",size=12),
axis.title.x = element_text(size=14, vjust = -0.2),
axis.title.y = element_text(size=14),
title = element_text(size = 14, vjust = 1.5),
legend.text = element_text(size=14)) +
coord_flip()
Figure 2. The weather events which have highest impact on economy (crop damage and property damage) in USA (years 1996-2011). Top 20 most harmful events shown.
We can see from Figure 2 that the top three economically damaging events for years 1996-2011 in USA were:
We found that, during years 1996-2011, tornados, heat, and floods were the most harmful events affecting public health. We also found that floods, hurricanes/typhoons, and storm surges had the biggest negative economic impact.