1 Synopsis

The goal of the assignment is to explore the NOAA Storm Database and explore the effects of severe weather events on both population and economy. The database covers the time period between 1950 and November 2011.

The analysis addresses the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPEEVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Information on the Data: Documentation

2 Data Processing

It is assumed that the raw data are available in the work directory.

2.1 Read Raw Data

For data reading and manipulation the dplyr package is used.

During the data read process the following data transformations are performed:

  • BGN_DATE and END_DATE are cast to date format
  • BGN_YEAR is derived from BGN_DATE
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(reshape2)
library(ggplot2)

sdb <-
    sdb <-read.csv("repdata_data_StormData.csv",sep=",",na.strings=NULL,colClasses =c("numeric","character","character","character","numeric","character","character","character","numeric","character","character","character","character","numeric","character","numeric","character","character","numeric","numeric","character","numeric","numeric","numeric","numeric","character","numeric","character","character","character","character","numeric","numeric","numeric","numeric","character","numeric")) %>%
    mutate(BGN_DATE=as.Date(BGN_DATE,format="%m/%d/%Y"),END_DATE=as.Date(END_DATE,format="%m/%d/%Y"),BGN_YEAR=year(BGN_DATE))

2.2 Check data quality over Time

In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

The number of available events over time is explored using a simple bar plot:

sdb_cnt<-
    sdb   %>%
    group_by(BGN_YEAR) %>%
    summarise(n(),.groups="keep")   

plot(sdb_cnt, type="h", col="red", xlab="Years", ylab="Number of cases", main="Number of cases over Time")

Based on the results years 1989-2011 are selected for further analysis.

2.3 Create tidy dataset

Property and crops are stored in the way that the base part and the exponent part are stored in different columns. However, the exponent notation is mixed. It may contain numbers such as “2” which translates to 10^2 or characters such as “M” which translates to 10^6.

unique(sdb[,"PROPDMGEXP"])
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(sdb[,"CROPDMGEXP"])
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

In the next the steps a tidy dataset is created

  • only years from 1989 are considered
  • PROPDMG_VAL and CROPDM_VAL are created
sdb_tidy <-
    sdb  %>%
  filter(BGN_YEAR >= 1989) %>%
    mutate(PROPDMG_VAL=PROPDMG*
                                case_when(
                                    toupper(PROPDMGEXP) == "1" ~ 10^1, 
                                    toupper(PROPDMGEXP) == "2" ~ 10^2, 
                                    toupper(PROPDMGEXP) == "3" ~ 10^3, 
                                    toupper(PROPDMGEXP) == "4" ~ 10^4, 
                                    toupper(PROPDMGEXP) == "5" ~ 10^5, 
                                    toupper(PROPDMGEXP) == "6" ~ 10^6, 
                                    toupper(PROPDMGEXP) == "7" ~ 10^7, 
                                    toupper(PROPDMGEXP) == "8" ~ 10^8,
                                    toupper(PROPDMGEXP) == "9" ~ 10^9, 
                                    toupper(PROPDMGEXP) == "H" ~ 10^2, 
                                    toupper(PROPDMGEXP) == "K" ~ 10^3, 
                                    toupper(PROPDMGEXP) == "M" ~ 10^6, 
                                    toupper(PROPDMGEXP) == "B" ~ 10^9, 
                                    .default = 10^0
                                ),
                CROPDM_VAL=CROPDMG*
                                case_when(
                                    toupper(CROPDMGEXP) == "2" ~ 10^2,
                                    toupper(CROPDMGEXP) == "K" ~ 10^3,
                                    toupper(CROPDMGEXP) == "M" ~ 10^6,
                                    toupper(CROPDMGEXP) == "B" ~ 10^9,
                                    .default = 10^0
                                )
                )

3 Results

In the last step two further datsets are created for the graphical analysis:

  1. sdb_health for the analysis of the most harmful events with respect to population health
  2. sdb_eco for the analysis of the events that have the greatest economic consequences

3.1 Events that are Most Harmful to Population Health

The following data transformations were performed:

  • only records with either FATALITIES > 0 or INJURIES > 0 considered
  • a new column - TOTAL_VAL is created as the sum of FATALITIES and INJURIES
  • the sum of TOTAL_VAL is obtained for each event. That is the base for the selection of the top 10
  • the dataset is unpivoted to satisfy the data requirement of ggplot2’s bar_plot() procedure
sdb_health<-
    sdb_tidy %>%
    filter(FATALITIES > 0|INJURIES > 0) %>%
    mutate(TOTAL_VAL=FATALITIES+INJURIES,EVTYPE=factor(EVTYPE)) %>%
    select(EVTYPE,FATALITIES,INJURIES,TOTAL_VAL) %>%
    group_by(EVTYPE) %>%
    summarise(across(where(is.numeric),sum),.groups="keep") %>% 
    ungroup()  %>%
    slice_max(TOTAL_VAL,n = 10)  %>%
    melt(id="EVTYPE", variable="CASUALTY_TYPE", value.name="CASUALTY_VALUE")  %>%
    filter(CASUALTY_TYPE!="TOTAL_VAL")

The following graph shows the top 10 events that are the most harmful to Population Health:

plot_health <- ggplot(sdb_health) +
  geom_bar(aes(x = reorder(EVTYPE, CASUALTY_VALUE), y = CASUALTY_VALUE, fill=CASUALTY_TYPE), 
           position = "stack", stat = "identity") +
    coord_flip() +
    labs(x = "Event Type",y = "Casualty Cases", title="Top 10 most harmful events (Health)")
print(plot_health)

3.2 Events that have the Greatest Economic Consequences

The following data transformations were performed:

  • only records with either PROPDMG_VAL > 0 or CROPDM_VAL > 0 considered
  • a new column - TOTAL_VAL is created as the sum of PROPDMG_VAL and CROPDM_VAL
  • the sum of TOTAL_VAL is obtained for each event. That is the base for the selection of the top 10
  • the dataset is unpivoted to satisfy the data requirement of ggplot2’s bar_plot() procedure
sdb_eco<-
    sdb_tidy %>%
    filter(PROPDMG_VAL > 0|CROPDM_VAL > 0) %>%
    mutate(TOTAL_VAL=PROPDMG_VAL+CROPDM_VAL) %>%
    select(EVTYPE,PROPDMG_VAL,CROPDM_VAL,TOTAL_VAL) %>%
    group_by(EVTYPE) %>%
    summarise(across(where(is.numeric),sum),.groups="keep") %>% 
    ungroup()  %>%
    slice_max(TOTAL_VAL,n = 10)  %>%
    melt(id="EVTYPE", variable="DAMAGE_TYPE", value.name="DAMAGE_VALUE")  %>%
    filter(DAMAGE_TYPE!="TOTAL_VAL")

The following graph shows the top 10 events that have the greatest economic consequences:

plot_eco <- ggplot(sdb_eco) +
  geom_bar(aes(x = reorder(EVTYPE, DAMAGE_VALUE), y = DAMAGE_VALUE, fill=DAMAGE_TYPE), 
           position = "stack", stat = "identity") +
    coord_flip() +
    labs(x = "Event Type",y = "Damage Value",title = "Top 10 most harmful events (Economy)")
print(plot_eco)