1. Synopsis

This project is part of the Reproducible Research Course by Coursera and explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage. The goal of this analysis is to find which events are most harmful with respect to population health and which events have the greatest economic consequences.

2. Data Processing

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The events in the database start in the year 1950 and end in November 2011. The file can be downloaded at this link:

2.1 Data Loading

First we need to load the following packages. They are going to be used to handle the data and plot the results.

library(dplyr)
library(ggplot2)
library(gridExtra)

Then we need to download the file and read it as comma separated values.

url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(url,'storm.csv.bz2')
storm_data <- read.csv('storm.csv.bz2')

Most of the information contained in the original data are not necessary for our analysis. For this reason, we create a subset selecting the columns that we need.

storm <- select(storm_data,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
storm$PROPDMGEXP <- as.character(storm$PROPDMGEXP)
storm$CROPDMGEXP <- as.character(storm$CROPDMGEXP)
head(storm)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

2.2 Property and Crop Damage Value

The Storm Data Documentation states that damage values are rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number. For example, alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
These exponents, in the columns PROPDMGEXP and CROPDMGEXP, are converted into numbers. For example M is converted into 6. These numbers are used to find the damage value, which is stored in the new columns PROP_VALUE and CROP_VALUE.

unique(storm$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

The alphanumeric values are substituted with numeric values. Then the new variable property value is calculated and stored in the new column PROP_VALUE.

storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('-','+','?','')] <- 0
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('h','H')] <- 2
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('m','M')] <- 6
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('k','K')] <- 3
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c('b','B')] <- 9

storm$PROP_VALUE <- storm$PROPDMG * (10 ** as.numeric(storm$PROPDMGEXP))

The same operation is conducted for Crop Damage.

unique(storm$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

The alphanumeric values are substituted with numeric values. Then the new variable crops value is calculated and stored in the new column CROP_VALUE.

storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('?','')] <- 0
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('m','M')] <- 6
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('k','K')] <- 3
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c('b','B')] <- 9

storm$CROP_VALUE <- storm$CROPDMG * (10 ** as.numeric(storm$CROPDMGEXP))

3. Results Analysis

Once the data is processed and clean, we can proceed and analyse the weather events.

3.1 Population Health

First we analyse which events are most harmful with respect to population health. This can be done observing the number of fatalities and injuries that the event caused.

The data is grouped by type of event and summarised finding the total number of fatalities per type of event. The table obtained is then sorted showing the top type of events. From the output we can see that tornado is the type of event causing the most fatalities.

fatalities <- storm %>% group_by(EVTYPE) %>%
                        summarise(fat=sum(FATALITIES)) %>%
                        arrange(desc(fat))
head(fatalities)
## # A tibble: 6 × 2
##           EVTYPE   fat
##           <fctr> <dbl>
## 1        TORNADO  5633
## 2 EXCESSIVE HEAT  1903
## 3    FLASH FLOOD   978
## 4           HEAT   937
## 5      LIGHTNING   816
## 6      TSTM WIND   504

The same process is repeated for the injuries. The data is grouped by type of event and summarised finding the total number of injuries per type of event. The table obtained is then sorted showing the top type of events. From the output we can see that tornado is the type of event causing the most injuries.

injuries <- storm %>% group_by(EVTYPE) %>%
                        summarise(inj=sum(INJURIES)) %>%
                        arrange(desc(inj))
head(injuries)
## # A tibble: 6 × 2
##           EVTYPE   inj
##           <fctr> <dbl>
## 1        TORNADO 91346
## 2      TSTM WIND  6957
## 3          FLOOD  6789
## 4 EXCESSIVE HEAT  6525
## 5      LIGHTNING  5230
## 6           HEAT  2100

These results can be plotted to better understand the events. We plot the first five types of events with the most number of fatalities or injuries. Observing the plots we can also notice the difference between the different events.

g1 <- ggplot(data = fatalities[1:5,],aes(reorder(EVTYPE,fat),fat))+
    geom_bar(stat = 'identity',fill='blue')+coord_flip()+
    ggtitle('Events with high Fatalities')+ylab('Fatalities')+xlab('Events')+
    geom_text(aes(label=fat),hjust=1.2,col='white')
g2 <- ggplot(data=injuries[1:5,],aes(reorder(EVTYPE,inj),inj))+
    geom_bar(stat = 'identity',fill='red')+coord_flip()+
    ggtitle('Events with high Injuries')+ylab('Injuries')+xlab('Events')+
    geom_text(aes(label=inj),hjust=1,col='white')
grid.arrange(g1,g2,ncol=1)

3.2 Economic Consequences

Another analysis regards the economic consequences of the weather events. In this case we find which events have the greatest economic consequences. In order to do this we consider two variables from our data set, the value of properties damaged and the value of crops damaged.

Starting with properties, our data set storm is grouped by type of event and summarised finding the total value of damaged properties per type of event. The table obtained is sorted showing the top events causing the most damage to properties. The results show that flood has the greatest consequence on properties.

properties <- storm %>% group_by(EVTYPE) %>%
                        summarise(prop=sum(PROP_VALUE)) %>%
                        arrange(desc(prop))
head(properties)
## # A tibble: 6 × 2
##              EVTYPE         prop
##              <fctr>        <dbl>
## 1             FLOOD 144657709807
## 2 HURRICANE/TYPHOON  69305840000
## 3           TORNADO  56947380677
## 4       STORM SURGE  43323536000
## 5       FLASH FLOOD  16822673979
## 6              HAIL  15735267513

The same process is repeated for crops. The data is grouped by event and summarised finding the total value of damaged crops per type of event. The table obtained is sorted to show the top events causing the most damage to crops. The results show that drought has the greatest consequence on crops.

crops <- storm %>% group_by(EVTYPE) %>%
                        summarise(crop=sum(CROP_VALUE)) %>%
                        arrange(desc(crop))
head(crops)
## # A tibble: 6 × 2
##        EVTYPE        crop
##        <fctr>       <dbl>
## 1     DROUGHT 13972566000
## 2       FLOOD  5661968450
## 3 RIVER FLOOD  5029459000
## 4   ICE STORM  5022113500
## 5        HAIL  3025954473
## 6   HURRICANE  2741910000

The results are plotted to better understand the events and to observe the difference between the different events.

p1 <- ggplot(data=properties[1:5,],aes(reorder(EVTYPE,prop),prop))+
    geom_bar(stat='identity',fill='blue')+
    coord_flip()+ggtitle('Events with high Properties Damages')+
    xlab('Events')+ylab('Property Damages')

p2 <- ggplot(data=crops[1:5,],aes(reorder(EVTYPE,crop),crop))+
    geom_bar(stat='identity',fill='red')+
    coord_flip()+ggtitle('Events with high Crops Damages')+
    xlab('Events')+ylab('Crops Damages')

grid.arrange(p1,p2,ncol=1)

Conclusion

The analysis conducted can be summarised as follows:

  • With regards to population health, Tornado is the event causing the highest number of fatalities and injuries. It is followed by Excessive Heat for fatalities and Thunderstorm Wind for injuries.

  • With regards to economic consequences, Flood is the event causing the highest property damage and Drought is the event causing the highest crops damage.