Synopsis.

This project uses the data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database collected in USA over years 1950-2011, which provide information about effects of different weather events on human fatalities and injuries as well as property damages and crop damages caused by these events [1, 2]. We performed a basic exploratory analysis of this data. In particular the project attempted to answer the following questions: (i) which types of events were most harmful with respect to population health? (ii) which types of events have the greatest economic consequences?
Here we found that, in the US during years 1996-2011, tornados, heat, and floods were the most harmful events affecting public health. We also found that floods, hurricanes, and storm surges had the biggest negative economic impact.

Data Processing.

The data were obtained from NOAA Storm Database as a files compressed with bz2 [3].

Loading packages which we will use for analysis.

library(dplyr)
library(tidyr)
library(ggplot2)

Reading raw data:

raw_data <- read.csv("storm_data.csv.bz2", na.strings = c("NA",""))
dim(raw_data)
## [1] 902297     37

There are 902297 rows and 37 columns in this dataset.
The dataset contains following variables:

names(raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Subseting the data to obtain data required for our analysis according to NOAA coodebook [2]:

data <- subset(raw_data, select = c("BGN_DATE","EVTYPE", "INJURIES", 
                                    "FATALITIES", "CROPDMG", "CROPDMGEXP",
                                    "PROPDMG", "PROPDMGEXP"))

Exploring the obtained dataset.

names(data)
## [1] "BGN_DATE"   "EVTYPE"     "INJURIES"   "FATALITIES" "CROPDMG"   
## [6] "CROPDMGEXP" "PROPDMG"    "PROPDMGEXP"
str(data)
## 'data.frame':    902297 obs. of  8 variables:
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
summary(data)
##               BGN_DATE                    EVTYPE          INJURIES        
##  5/25/2011 0:00:00:  1202   HAIL             :288661   Min.   :   0.0000  
##  4/27/2011 0:00:00:  1193   TSTM WIND        :219940   1st Qu.:   0.0000  
##  6/9/2011 0:00:00 :  1030   THUNDERSTORM WIND: 82563   Median :   0.0000  
##  5/30/2004 0:00:00:  1016   TORNADO          : 60652   Mean   :   0.1557  
##  4/4/2011 0:00:00 :  1009   FLASH FLOOD      : 54277   3rd Qu.:   0.0000  
##  4/2/2006 0:00:00 :   981   FLOOD            : 25326   Max.   :1700.0000  
##  (Other)          :895866   (Other)          :170878                      
##    FATALITIES          CROPDMG          CROPDMGEXP        PROPDMG       
##  Min.   :  0.0000   Min.   :  0.000   K      :281832   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:  0.000   M      :  1994   1st Qu.:   0.00  
##  Median :  0.0000   Median :  0.000   k      :    21   Median :   0.00  
##  Mean   :  0.0168   Mean   :  1.527   0      :    19   Mean   :  12.06  
##  3rd Qu.:  0.0000   3rd Qu.:  0.000   B      :     9   3rd Qu.:   0.50  
##  Max.   :583.0000   Max.   :990.000   (Other):     9   Max.   :5000.00  
##                                       NA's   :618413                    
##    PROPDMGEXP    
##  K      :424665  
##  M      : 11330  
##  0      :   216  
##  B      :    40  
##  5      :    28  
##  (Other):    84  
##  NA's   :465934
unique(data$CROPDMGEXP)
## [1] <NA> M    K    m    B    ?    0    k    2   
## Levels: ? 0 2 B k K m M
unique(data$PROPDMGEXP)
##  [1] K    M    <NA> B    m    +    0    5    6    ?    4    2    3    h   
## [15] 7    H    -    1    8   
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

Converting the values of CROPDMG and PROPDMG into dollar amounts:

According to NOAA codebook, the variables CROPDMGEXP and PROPDMGEXP adjust the dollar amounts in billions(B), millions (M), or thousands (K) for the columns CROPDMG and PROPDMG, respectively. We can see that there are some characters in the CROPDMGEXP and PROPDMGEXP columns (such as -, ?, +, 0-9), which are not described in the codebook and are most probably mistakes of data entry. There are also many NAs in CROPDMGEXP and PROPDMGEXP that reflect that values in these entries of CROPDMG and PROPDMG do not need an adjustment. We will use only fields containing “B/b”, “M/m” or “K/k” for the purpose of calculating dollar amounts of CROPDMG and PROPDMG.

data[,9] <- 0
data[,6] <- as.character(data[,6])

data[grep("B|b", data$CROPDMGEXP), 9] <- 9
data[grep("M|m", data$CROPDMGEXP), 9] <- 6
data[grep("K|k", data$CROPDMGEXP), 9] <- 3
data[,5] <- data$CROPDMG*(10^data[,9])

data[,10] <- 0
data[,8] <- as.character(data[,8])

data[grep("B|b", data$PROPDMGEXP), 10] <- 9
data[grep("M|m", data$PROPDMGEXP), 10] <- 6
data[grep("K|k", data$PROPDMGEXP), 10] <- 3
data[,7] <- data$PROPDMG*(10^data[,10])

Now, we can remove the columns used for adjustments:

data <- subset(data, select = c("BGN_DATE","EVTYPE", "INJURIES", "FATALITIES", 
                                "CROPDMG",  "PROPDMG"))
summary(data)
##               BGN_DATE                    EVTYPE          INJURIES        
##  5/25/2011 0:00:00:  1202   HAIL             :288661   Min.   :   0.0000  
##  4/27/2011 0:00:00:  1193   TSTM WIND        :219940   1st Qu.:   0.0000  
##  6/9/2011 0:00:00 :  1030   THUNDERSTORM WIND: 82563   Median :   0.0000  
##  5/30/2004 0:00:00:  1016   TORNADO          : 60652   Mean   :   0.1557  
##  4/4/2011 0:00:00 :  1009   FLASH FLOOD      : 54277   3rd Qu.:   0.0000  
##  4/2/2006 0:00:00 :   981   FLOOD            : 25326   Max.   :1700.0000  
##  (Other)          :895866   (Other)          :170878                      
##    FATALITIES          CROPDMG             PROPDMG         
##  Min.   :  0.0000   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:  0.0000   1st Qu.:0.000e+00   1st Qu.:0.000e+00  
##  Median :  0.0000   Median :0.000e+00   Median :0.000e+00  
##  Mean   :  0.0168   Mean   :5.442e+04   Mean   :4.736e+05  
##  3rd Qu.:  0.0000   3rd Qu.:0.000e+00   3rd Qu.:5.000e+02  
##  Max.   :583.0000   Max.   :5.000e+09   Max.   :1.150e+11  
## 

Since we are interested in comparing data for all types of recorded events and according to NOAA website the data for all events started to be recorded from year 1996, we will subset the dataset starting from year 1996.

data1996 <- mutate(data, DATE = as.Date(as.character(BGN_DATE), "%m/%d/%Y")) %>%
            filter(DATE > as.Date("1995-12-31"))

summary(data1996)
##               BGN_DATE                    EVTYPE          INJURIES       
##  5/25/2011 0:00:00:  1202   HAIL             :207715   Min.   :0.00e+00  
##  4/27/2011 0:00:00:  1193   TSTM WIND        :128662   1st Qu.:0.00e+00  
##  6/9/2011 0:00:00 :  1030   THUNDERSTORM WIND: 81402   Median :0.00e+00  
##  5/30/2004 0:00:00:  1016   FLASH FLOOD      : 50999   Mean   :8.87e-02  
##  4/4/2011 0:00:00 :  1009   FLOOD            : 24247   3rd Qu.:0.00e+00  
##  4/2/2006 0:00:00 :   981   TORNADO          : 23154   Max.   :1.15e+03  
##  (Other)          :647099   (Other)          :137351                     
##    FATALITIES           CROPDMG             PROPDMG         
##  Min.   :  0.00000   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:  0.00000   1st Qu.:0.000e+00   1st Qu.:0.000e+00  
##  Median :  0.00000   Median :0.000e+00   Median :0.000e+00  
##  Mean   :  0.01336   Mean   :5.318e+04   Mean   :5.612e+05  
##  3rd Qu.:  0.00000   3rd Qu.:0.000e+00   3rd Qu.:1.250e+03  
##  Max.   :158.00000   Max.   :1.510e+09   Max.   :1.150e+11  
##                                                             
##       DATE           
##  Min.   :1996-01-01  
##  1st Qu.:2000-11-21  
##  Median :2005-05-14  
##  Mean   :2004-10-25  
##  3rd Qu.:2008-08-22  
##  Max.   :2011-11-30  
## 

Results.

We have noticed that in NOAA codebook some very similar types of events are separated into different groups (for example, “HEAT” and “EXCESSIVE HEAT”, or “TSTM WIND”, THUNDERSTORM WIND" and “THUNDERSTORM WINDS”) and should probably be considered as a single category. We have not combined them into a single category, because we want to be consistent with NOAA codebook. However, we will consider the combined impact of these similar events at the final stage of our analysis.

1. Impact on Population Health.

Now we will determine which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health across the United States. The two types of variables in this dataset that are considered harmful are FATALITIES and INJURIES.

Subsetting and summarizing data:

harm <- select(data1996, EVTYPE, FATALITIES, INJURIES) %>%
                group_by(EVTYPE) %>%
                    summarise(fatalities = sum(FATALITIES), injuries = sum(INJURIES)) %>%
                        arrange(desc(fatalities+injuries))

harm <- harm[1:30,] # select top 30 events
harm_tidy <- gather(harm, harm_type, count, fatalities, injuries)

Plotting the the top 30 events:

ggplot(harm_tidy, aes(x = reorder(EVTYPE, count), fill = harm_type)) + 
        geom_bar(aes(y=count), stat = "identity", position = "stack") +
        xlab("") +
        ylab("Number of fatalities and injuries") +
        ggtitle("The weather events with highest impact on population health") +
        theme(axis.text.x = element_text(colour="grey20",size=12),
              axis.text.y = element_text(colour="grey20",size=12),
              axis.title.x = element_text(size=14, vjust = -0.2),
              axis.title.y = element_text(size=14),
              title = element_text(size = 14, vjust = 1.5),
              legend.text = element_text(size=14)) +   
        coord_flip()
Figure 1. The weather events which have highest impact on population health (fatalities and injuries) in USA (years 1996-2011). Top 30 most harmful events are shown.

Figure 1. The weather events which have highest impact on population health (fatalities and injuries) in USA (years 1996-2011). Top 30 most harmful events are shown.

We can see from Figure 1 that the top three harmful events years 1996-2011 in USA were:

  1. Tornados
  2. Heat (Excessive Heat + Heat)
  3. Floods (Flood + Flash Flood)

2. Impact on Economics.

Now we will determine which types of events have the greatest economic consequences. Two variables in this dataset that reflect economic impact are property damage (PROPDMG) and crop damage (CROPDMG).

Subsetting and summarizing data:

damage <- select(data1996, EVTYPE, PROPDMG, CROPDMG) %>%
                group_by(EVTYPE) %>%
                    summarise(property = sum(PROPDMG), crops = sum(CROPDMG)) %>%
                        arrange(desc(property+crops))

damage <- damage[1:30,] # select top 30 events
damage_tidy <- gather(damage, damage_type, dollars, property, crops)

Plotting the the top 30 events:

ggplot(damage_tidy, aes(x = reorder(EVTYPE, dollars), fill = damage_type)) + 
        geom_bar(aes(y=dollars/1000000000), stat = "identity", position = "stack") +
        xlab("") +
        ylab("Economical damage, billions of US dollars") +
        ggtitle("The weather events with highest impact on economy") +
        theme(axis.text.x = element_text(colour="grey20",size=12),
              axis.text.y = element_text(colour="grey20",size=12),
              axis.title.x = element_text(size=14, vjust = -0.2),
              axis.title.y = element_text(size=14),
              title = element_text(size = 14, vjust = 1.5),
              legend.text = element_text(size=14)) +   
        coord_flip()
Figure 2. The weather events which have highest impact on economy (crop damage and property damage) in USA (years 1996-2011). Top 20 most harmful events shown.

Figure 2. The weather events which have highest impact on economy (crop damage and property damage) in USA (years 1996-2011). Top 20 most harmful events shown.

We can see from Figure 2 that the top three economically damaging events for years 1996-2011 in USA were:

  1. Floods
  2. Hurricanes/Typhoons
  3. Storm Surges

Conclusion:

We found that, during years 1996-2011, tornados, heat, and floods were the most harmful events affecting public health. We also found that floods, hurricanes/typhoons, and storm surges had the biggest negative economic impact.