Synopsis

This report is devoted to the analysis of the US severe weather events and their consequences. Using the data from NOAA Storm Database on the weather conditions and associated damage in US from 1950 to 2011, we are trying to answer the following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Our analysis will deal with the aggregated data for weather types with summed damage values, without any regards to time and location.

Loading and Processing the Raw Data

First we load and process the raw data. We start with downloading the archive from the web and load it into R.

#download initial data (if it is not presented)
if(!file.exists("repdata-data-StormData.csv.bz2")){
        dataUrl<-
                "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(dataUrl_zip,destfile=".\\repdata-data-StormData.csv.bz2")  
        }
         
#read data directly from bz2 archive
storm_data<-read.csv("repdata-data-StormData.csv.bz2")

Then we do the initial exploration of the data.

#number of observations and variables
dim(storm_data)
## [1] 902297     37
#structure of data
str(storm_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

As we can see there are 37 variables of different types containing some information on the US severe weather events. However, per the nature of our research we do not need most variables in the data, so we are transforming it leaving only the following ones:

Basically, to answer the questions set above, we do not need any time or location data. For the Q1 (health damage) we need FATALITIES and INJURIES variables, for Q2 (economical damage) we need PROP… and CROP… variables. Further in the analysis, we will be aggregating those variables based on the types of severe events (EVTYPE) to see which ones are the most harmful.

#introduce transformed data, with variables needed
storm<-storm_data[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP",
                     "CROPDMG","CROPDMGEXP")]
#remove the intitial big dataframe
rm(storm_data)

However, before we start the analysis, we need to perform a bit more transformations.

First, if we look at the event type column, we can see that there are quite a lot of unique values.

length(unique(storm$EVTYPE))
## [1] 985

985 different values is much more than 50 indicated in documentation. After the examination of the set of values we perform the clean-up reducing the number of unique severe weather types.

#set everything to lowe case
storm$EVTYPE<-tolower(storm$EVTYPE)
#delete summary rows (like "summary of july 2")
storm<-storm[!grepl('summary',storm$EVTYPE),]
#set common types to the most common disasters
storm$EVTYPE[grepl('hail',storm$EVTYPE)]<-"hail"
storm$EVTYPE[grepl('flood|tide|stream|urban',storm$EVTYPE)]<-
        "flood/stream"
storm$EVTYPE[grepl('wind|wnd',storm$EVTYPE)]<-"wind"
storm$EVTYPE[grepl('tornado|torndao',storm$EVTYPE)]<-"tornado"
storm$EVTYPE[grepl('heat|dry|hot|warm|driest',storm$EVTYPE)]<-"excessive heat"
storm$EVTYPE[grepl('snow',storm$EVTYPE)]<-"snow"
storm$EVTYPE[grepl('ice',storm$EVTYPE)]<-"ice"
storm$EVTYPE[grepl('rain|shower|wet|precip',storm$EVTYPE)]<-"rain/wet weather"
storm$EVTYPE[grepl('fire',storm$EVTYPE)]<-"wildfire"
storm$EVTYPE[grepl('hurricane|typhoon',storm$EVTYPE)]<-"hurricane"
storm$EVTYPE[grepl('thunderstorm',storm$EVTYPE)]<-"thunderstorm"
storm$EVTYPE[grepl('lightn',storm$EVTYPE)]<-"lightning"
storm$EVTYPE[grepl('freez|frost',storm$EVTYPE)]<-"freeze/frost"
storm$EVTYPE[grepl('cold|winter|glaze|cool',storm$EVTYPE)]<-"cold/winter"
storm$EVTYPE[grepl('record',storm$EVTYPE)]<-"record high/low temperature"
storm$EVTYPE[grepl('\\<storm\\>',storm$EVTYPE)]<-"storm"
storm$EVTYPE[grepl('avalanc',storm$EVTYPE)]<-"avalanche"
storm$EVTYPE[grepl('beach eros',storm$EVTYPE)]<-"beach erosion"
storm$EVTYPE[grepl('watersp|waytersp|water sp',storm$EVTYPE)]<-"waterspout"
storm$EVTYPE[grepl('volcan',storm$EVTYPE)]<-"volcano activity"
storm$EVTYPE[grepl('rip current',storm$EVTYPE)]<-"rip current"
storm$EVTYPE[grepl('funnel',storm$EVTYPE)]<-"funnel clouds"
storm$EVTYPE[grepl('fog',storm$EVTYPE)]<-"fog"
#check how many unique values left
length(unique(storm$EVTYPE))
## [1] 110

We have reduced number of different weather types from 985 to 110, which is a considerable gain.

Now, for the proper damage values representation we need to combine the damage values with their magnitude counterparts. This what magnitude values look like:

#set all the values to lower case
storm$PROPDMGEXP<-tolower(storm$PROPDMGEXP)
storm$CROPDMGEXP<-tolower(storm$CROPDMGEXP)
#check the value set
unique(storm$PROPDMGEXP)
##  [1] "k" "m" ""  "b" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "-" "1" "8"
unique(storm$CROPDMGEXP)
## [1] ""  "m" "k" "b" "?" "0" "2"

We can see that these values are letters as well as numbers (and also some signs). According to data description they represent the exponent value with base 10 that needs to be applied to the damage value. We also know that “h” means 100 (10^2), “k” means 1000 (10^3), “m” means million (10^6), and “h” means billion (10^9). The numbers simply mean the exponent number (number of zeros). For other characters we assume nothing should be added to values (0 value).
With this information we perform the transformations.

#property damage values
storm$PROPDMGEXP[storm$PROPDMGEXP %in% c("0","+","?","-","")] <- 0
storm$PROPDMGEXP[storm$PROPDMGEXP == "h"] <- 2
storm$PROPDMGEXP[storm$PROPDMGEXP == "k"] <- 3
storm$PROPDMGEXP[storm$PROPDMGEXP == "m"] <- 6
storm$PROPDMGEXP[storm$PROPDMGEXP == "b"] <- 9
storm$PROPDMGEXP<-as.numeric(storm$PROPDMGEXP)
unique(storm$PROPDMGEXP)
##  [1] 3 6 0 9 5 4 2 7 1 8
storm$PROPDMG<-storm$PROPDMG*10^storm$PROPDMGEXP
storm$PROPDMGEXP<-NULL

#crop damage values
storm$CROPDMGEXP[storm$CROPDMGEXP %in% c("?","")] <- 0
storm$CROPDMGEXP[storm$CROPDMGEXP == "k"] <- 3
storm$CROPDMGEXP[storm$CROPDMGEXP == "m"] <- 6
storm$CROPDMGEXP[storm$CROPDMGEXP == "b"] <- 9
storm$CROPDMGEXP<-as.numeric(storm$CROPDMGEXP)
unique(storm$CROPDMGEXP)
## [1] 0 6 3 9 2
storm$CROPDMG<-storm$CROPDMG*10^storm$CROPDMGEXP
storm$CROPDMGEXP<-NULL

After the transformation we have the final data set with 5 variables:

The weather event types are now in better order, and the values are represented correctly.

Analysis

In this section we perform the analysis of data in order to answer the questions stated at the beginning.
To perform the analysis we have to aggregate the data based on the event type and having the sum aggregator function for numerical values (damage measures).

#aggregate the data
storm_aggr<-aggregate(.~EVTYPE,data=storm,FUN=sum)

Q1: Most harmful weather events: health damage

To answer this question we first should determine what most harmful means. As we are having two measures - number of injuries and number of deaths, we need to set the approach to ordering. Here we propose to treat one death as 5 injuries and set a score based on this assumption to each of the event types. After that we subtract the top 10 events by the obtained score.

#get subset for health-related data
storm_aggr_health<-storm_aggr[,c(1,2,3)]
storm_aggr_health$score<-storm_aggr_health$FATALITIES*5+storm_aggr_health$INJURIES
storm_aggr_health_top10<-storm_aggr_health[order(storm_aggr_health$score,
                                                 decreasing=TRUE)[1:10],]

Once we have the data for top 10 weather events, we show them on the plot.

library(reshape2)
library(ggplot2)
#transform data to long format
storm_aggr_health_top10_long<-melt(storm_aggr_health_top10[,1:3])
## Using EVTYPE as id variables
#create ggplot
storm_aggr_health_top10_long$EVTYPE <- 
        factor(storm_aggr_health_top10_long$EVTYPE,
                                              levels=
                       unique(storm_aggr_health_top10_long$EVTYPE))

ggplot(storm_aggr_health_top10_long,aes(x=EVTYPE,y=value,fill=factor(variable)))+
  geom_bar(stat="identity",position="dodge")+
         theme(text = element_text(size=20),
               axis.text.x=element_text(angle=45,hjust=1))+
        labs(x="Event type",y="Affected people")+
        ggtitle("Tot 10 most harmful weather conditions: health")

As we can see, tornado is the most harmful weather condition, with huge amount of injuries and deaths.

Q2: Most harmful weather events: economic damage

Here we do not introduce any conversion factor between crop and property damage types - we threat them as equal. So, the ordering score is just the sum of two values.

#get subset for economy-related data
storm_aggr_econ<-storm_aggr[,c(1,4,5)]
storm_aggr_econ$score<-storm_aggr_econ$PROPDMG+storm_aggr_econ$CROPDMG
storm_aggr_econ_top10<-storm_aggr_econ[order(storm_aggr_econ$score,
                                                 decreasing=TRUE)[1:10],]

Once we have the data for top 10 weather events, we show them on the plot.

#transform data to long format
storm_aggr_econ_top10_long<-melt(storm_aggr_econ_top10[,1:3])
## Using EVTYPE as id variables
#create ggplot
storm_aggr_econ_top10_long$EVTYPE <- 
        factor(storm_aggr_econ_top10_long$EVTYPE,
                                              levels=
                       unique(storm_aggr_econ_top10_long$EVTYPE))

ggplot(storm_aggr_econ_top10_long,aes(x=EVTYPE,y=value/1000000,fill=factor(variable)))+
  geom_bar(stat="identity",position="dodge")+
         theme(text = element_text(size=20),
               axis.text.x=element_text(angle=45,hjust=1))+
        labs(x="Event type",y="Damage ($, Millions)")+
        ggtitle("Tot 10 most harmful weather conditions: economy")

As we can see, flood is the most harmful weather condition, with huge economic impact both in property and crop.

Results

As we have conducted the research, we can make the following conclusions: