Executive summary (synopsis)

We use the StormData dataset collected by the National Weather Service (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) to find the 25 types of severe weather events that cause the most damage on human health and property. The analysis provided in this document can be used as a policy tool to suggest the scale of preparation needed for different types of severe weather events.

The impact on human health can be fatalities or injuries, and we present a total view over the two categories. The impact on property can be property damage or crops damage, both measured in USD, and we present a total view over the two categories.

We measure the damages on health and property both as total per event type, and as average per occurence of event type. The totals provide information on the overall most damaging type of event, due to both the number of times it occurs and its severity, while the means provide information on the severity of the type of event.

We find that in terms of totals the most damaging event on population health is a tornado weather event by a very large margin. In terms of means, based on the information available in the dataset, the most damaging weather event on population health is a heat wave. We also find that in terms of totals the most damaging event on property and crops is a tornado weather event. In terms of means, based on the information available in the dataset, the most damaging weather event on porperty and crops is the tropical storm “Gordon”.

Data processing

Set up of the environment

 setwd("/Users/nikolaydobrinov/Documents/work/Courses/R/WorkDirectory/Course5_week2_coding_assignment")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Loading data and taking a quick look

For the purposes of reproducibility the following code downloads the data, and then loads the csv file into R. We also examine which variables are useful to answer the questions required by the assignment.

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="./Data/StormData.csv.bz2",method="curl")
# list.files("./Data")

storm <- read.csv("./Data/StormData.csv.bz2")
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The variables that appear to be useful to answer the questions required by the assignment are EVTYPE (group by it), FATALITIES, INJURIES, PROPDMG and CROPDMG.

The number of unique event types is 985.

length(unique(storm$EVTYPE))
## [1] 985

The analysis in the next section assumes that FATALITIES and INJURIES measure the total effect on population health by severe weather events, and that PROPDMG and CROPDMG measure the ecenomic impact from severe weather events. The data set does not provide information on the economic value of population health, and thus the economic damage in this analysis will be measured only by the damage on phisical property and farm crops. In addition we measure the economic impact in terms of dollars and not units of buldings or acres of crop damaged.

To measure the impact of severe weather events on health and property we will measure both totals and means per event type. The totals will provide information on the overall most damaging type of event, due to both the number of times it occurs and its severity. The means will provide information on the severity of the type of event.

Select only the variables that will be needed, sum the health damages across fatalieties and injuries, and sum economic damages across property and crop damages.

data <- select(storm,EVTYPE,FATALITIES, INJURIES, PROPDMG, CROPDMG)
names(data) <- tolower(names(data))

data <- data %>% 
        mutate(health=fatalities+injuries, econ=propdmg+cropdmg) %>%
        select(evtype,health, econ) 

Group the data by evtype and calculate totals and means of health and econ measures across the groups. Also calculate the total times each event has occured in the data set.

group_data <- data %>%
        group_by(evtype) %>% 
        summarise(health_total = sum(health),health_mean = mean(health),
                  econ_total = sum(econ),econ_mean = mean(econ), occurs = n())

The following table presents a look at the top 10 rows of the data set post processing.

head(group_data,10)
## # A tibble: 10 × 6
##                   evtype health_total health_mean econ_total econ_mean
##                   <fctr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1     HIGH SURF ADVISORY            0           0        200       200
## 2          COASTAL FLOOD            0           0          0         0
## 3            FLASH FLOOD            0           0         50        50
## 4              LIGHTNING            0           0          0         0
## 5              TSTM WIND            0           0        108        27
## 6        TSTM WIND (G45)            0           0          8         8
## 7             WATERSPOUT            0           0          0         0
## 8                   WIND            0           0          0         0
## 9                      ?            0           0          5         5
## 10       ABNORMAL WARMTH            0           0          0         0
## # ... with 1 more variables: occurs <int>

Analysis

The plots below provide a ranking of the top 25 types of severe weather events, by the number of times they occur and by the damages they inflict on population health and property.

First, rank the event types by frequency of occurence. Sort the data by ‘occurs’ and subset only the tope 25 evtypes. Plot the top 25 event types by frequency of occurence. It appears that the event that occurs most often in this dataset is “Hail”.

occurs <- group_data %>% arrange(desc(occurs)) %>% filter(row_number() <= 25)
# order evtype by the health_total, otherwise they dont appear in the correct 
# order on the ggplot
occurs <- transform(occurs, 
                          evtype = reorder(evtype, occurs))


ggplot(data=occurs, aes(x=evtype, y=occurs)) +
        geom_bar(stat="identity") + coord_flip() +
        labs(title="Number of occurences in the dataset \n by the type of severe weather event \n (top 25 event types)", 
             x="Type of severe weather event",y="Number of occurences") +
        theme(plot.title = element_text(hjust = 0.5))

Next, rank the event types by the damage they inflict on population health. Sort the data by health_total and health_mean and subset only the top 25 evtypes. Plot the top 25 event types by total and mean damage on population health. Note that the ranking of totals and the ranking of means does not coincide. This means that some events may be leading to disproportionate levels of total damage not only due to high severity but also because they occur more often in the data set. The dataset is only representative of the events that have occured from 1950 to 2011, with many missing observations in the early years of data collection. We find that in terms of totals the most damaging event on population health is a tornado weather event by a very large margin. In terms of means, based on the information available in the dataset, the most damaging weather event on population health is a heat wave.

# Sort the data by health_total, health_mean and subset only the tope 25 evtypes of each sorted dataset.

health_total <- group_data %>% arrange(desc(health_total)) %>% filter(row_number() <= 25)
# order evtype by the health_total, otherwise they dont appear in the correct 
# order on the ggplot
health_total <- transform(health_total, 
                          evtype = reorder(evtype, health_total))

health_mean <- group_data %>% arrange(desc(health_mean)) %>% filter(row_number() <= 25)
health_mean <- transform(health_mean, 
                          evtype = reorder(evtype, health_mean))

# place two ggplots on a grid
healthplot1 <- ggplot(data=health_total, aes(x=evtype, y=health_total)) +
        geom_bar(stat="identity") + coord_flip() +
        labs(title="Number of individuals impacted \n by the type of severe weather event \n (top 25 event types)", 
             x="Type of severe weather event",y="Number of individuals impacted") +
        theme(plot.title = element_text(hjust = 0.5,size = 9))

healthplot2 <- ggplot(data=health_mean, aes(x=evtype, y=health_mean)) +
        geom_bar(stat="identity") + coord_flip() +
        labs(title="Average number of individuals impacted \n by one occurence of each type \n of severe weather event \n (top 25 event types)", 
             x="Type of severe weather event",y="Average number of individuals impacted") +
        theme(plot.title = element_text(hjust = 0.5,size = 9))
grid.arrange(healthplot1, healthplot2, ncol=2)

Next, rank the event types by the economic damage they inflict on porperty and crops. Sort the data by econ_total and econ_mean and subset only the top 25 evtypes. Plot the top 25 event types by total and mean economic damage on porperty and crops. Note that the ranking of totals and the ranking of means does not coincide. This means that some events may be leading to disproportionate levels of total damage not only due to high severity but also because they occur more often in the data set. The dataset is only representative of the events that have occured from 1950 to 2011, with many missing observations in the early years of data collection. We also find that in terms of totals the most damaging event on property and crops is a tornado weather event. In terms of means, based on the information available in the dataset, the most damaging weather event on porperty and crops is the tropical storm “Gordon”.

# Sort the data by econ_total, econ_mean and subset only the tope 25 evtypes of each sorted dataset.
econ_total <- group_data %>% arrange(desc(econ_total)) %>% filter(row_number() <= 25)
econ_total <- transform(econ_total, 
                          evtype = reorder(evtype, econ_total))

econ_mean <- group_data %>% arrange(desc(econ_mean)) %>% filter(row_number() <= 25)
econ_mean <- transform(econ_mean, 
                        evtype = reorder(evtype, econ_mean))

# place two ggplots on a grid
econplot1 <- ggplot(data=econ_total, aes(x=evtype, y=econ_total)) +
        geom_bar(stat="identity") + coord_flip() +
        labs(title="Damages on property and crops \n by the type of severe weather event \n (top 25 event types)", 
             x="Type of severe weather event",y="Damages on property and crops (in USD)") +
        theme(plot.title = element_text(hjust = 0.5,size = 9))

econplot2 <- ggplot(data=econ_mean, aes(x=evtype, y=econ_mean)) +
        geom_bar(stat="identity") + coord_flip() +
        labs(title="Average damages on property/crops \n by one occurence of each type \n of severe weather event \n (top 25 event types)", 
             x="Type of severe weather event",y="Average damages on property and crops (in USD)") +
        theme(plot.title = element_text(hjust = 0.5,size = 9))
grid.arrange(econplot1, econplot2, ncol=2)

Conclusion

We find that in terms of totals the most damaging event on population health is a tornado weather event by a very large margin. In terms of means, based on the information available in the dataset, the most damaging weather event on population health is a heat wave. We also find that in terms of totals the most damaging event on property and crops is a tornado weather event. In terms of means, based on the information available in the dataset, the most damaging weather event on porperty and crops is the tropical storm “Gordon”. For the purposes of policy analysis note that some events have a lower chance of occurence, but may have a disporportionately high severity. Both the chance of occurene and the severity of the event must be taken into account for policy analisys.