1. Synopsis

The Storm Events Database is a set of weather related event data which is collected from across the US by the National Weather Service. This database has a rich set of data which captures severe weather events and any impacts it has on the health and welfare of the people and any economic impact on property or crops. The data has been collected since the 1950’s and while the earlier data is not as complete, on examination, it was decided to use it all for the analysis for our purposes. The purpose of this analysis is to understand the economic and public health impact of severe weather events on the US population. In understanding the impact to people and property of various weather related events, these risks can be mitigated by early warning programs and allocating resources to minimise the damage to people and property. The Storm Events dataset provided analysis for this paper.

2. Data Processing

The data was processed with the intention that the analysis could be reproduced by another person with access to the dataset, with the data processing and analysis that is documented in this paper and with access to two other files of reference data that were used to further tidy the data. The two files mentioned are Event-Categories.csv and stormdata-col-names.csv. The purpose of these files is: 1. Event-Categories.csv - As the Event Types in the Storm Events dataset are quite messy, overlapping and not completely coherent, it was decided to create a mapping file to map each Event Type to a higher level category for analysis. This file provides the link to how the Event Types are grouped. 2. stormdata-col-names.csv - a list of more user friendly variable names that are used to change the column names of the data set once loaded. These are listed in the same order a the current dataset.

Getting the environment ready

The following libraries were used in the analysis

      library(dplyr); library(ggplot2); library(readr); library(lubridate)  

The files used as part of the analysis are listed and assigned to variables for use later.

      raw.storm.data.file.source <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
      raw.storm.data.file <- "repdata-data-StormData.csv.bz2"
      event.category.file <- "Event-Categories.csv"
      col.names.file <- "stormdata-col-names.csv" 
            #'    A list of column types to ensure we know what type we will get
            #'    when reading the file
      col.types <- "icccicccccccccccccddddddncdcccciiiici"  # set specific column types

Loading the Data

The data was first downloaded from the source and then the dplyr function read_csv used to unpack and read the raw file. Once the data frame was created, the column names were changed to the more friendly ones. Also, the dates, which had been imported as chracter were changed to date format using the lubridate parse_date function. Now we have a dataset that is ready to be manipulated to support our analysis.

      download.file(raw.storm.data.file.source, "repdata-data-StormData.csv.bz2")
      storm.data <- read_csv(raw.storm.data.file, col_types = col.types)
            #'change the column names so that they are more readable
      col.names <- read_csv(col.names.file)
      colnames(storm.data) <-  col.names$Name
            #'parse the date field and ensure it is good date format
      storm.data$Begin.Date <- parse_date(storm.data$Begin.Date, "%m/%d/%Y %H:%M:%S")

Tidying the Data

Categorising Event Types The first task in tidying the data is grouping the messy Event Types into more coherent categories. This was done by manually mapping the existing event types to some broader categories suggested by the groupings and overlap of the exising data e.g. SNOW/ICE SNOW/ICE STORM SNOW/RAIN SNOW/RAIN/SLEET SNOW/SLEET SNOW/SLEET/FREEZING RAIN –> all became Snow/Ice etc

The mapping file was then used as a “master data” reference set and using a left join to the event data frame, it pulled in the Event Categories to enable easier analysis. This was really required as there are over 900 unique instances of value in the Event Type field which confuses the picture when you try doing analysis

Just prior to that, the data frame was subsetted to another data frame, just taking the columns we required.

            #' ----------- Now get the aggregation of event types into catgories
            #'                which will help with analysis
     event.category.data <- read_csv(event.category.file)
            
            #' ----------- get the subset of data we want to use for analysis
      event.data <- select(storm.data, Event.Type, Begin.Date, State.Code, No.of.Fatalities,
                           No.of.Injuries, Property.Damage, Property.Damage.UOM, 
                           Crop.Damage, Crop.Damage.UOM)
            #' ----------- Now we want to add an event category to make analysis easier
            #'                 and group the rather unruly event types
      event.data <- left_join(event.data, event.category.data, by = "Event.Type")

** Converting the Property and Crop Damage to the same unit of measure** In the source file there are two columns associated with the property and crop damage. They are PROPDMGEXP and CROPDMGEXP. They hold the values B, M, and K mostly, as like much of the data, there are some data quality issues. It is assumed these are the units of measure of the number if the damage column i.e. B=billions, M=millions and K=thousands and this lines up with the numberic values in the columns. Where there was no B,M or K, the value was 0. So it was taken that the Crop and Property damage observations needed to be brought to the same unit of measure and this we done using the following code.

It was a simple subsetting and multiplying out of the data to get variables with a common unit of measure i.e. dollars.

Firstly, convert the property damage:

            #' ---------------------------------------------------------------------------
            #' now calculate the value of property damage and crop damage as we need these
            #' The variables Property.Damage and Crop.Damage hold these, however there is a
            #' Unit of Measure (B=Billions, M=millions, K=thousands) in the Property.Damage.UOM
            #' and the Crop.Damage.UOM that we will need to use to calculate it correctly
            #' So, need to multiply by UOM to get correct value.

      event.data <- mutate(event.data, Prop.Dmg.Dollars = 0)      #' Property
      event.data <- mutate(event.data, Crop.Dmg.Dollars = 0)      #' Crops
      event.data <- mutate(event.data, Total.Damage.Dollars = 0)  #' Property + Crop Damage
      event.data <- mutate(event.data, Total.PeopleImpacted = 0)  #' Fatalities + Injuries
            #' First the Property Damage --------------------------------------------------
      B.rows <- event.data$Property.Damage.UOM == "B"       #' Billions
      M.rows <- event.data$Property.Damage.UOM == "M"       #' Millions
      K.rows <- event.data$Property.Damage.UOM == "K"       #' Thousands
            #' calculate the totals and update the dataset with new values
      if(sum(B.rows == TRUE)!=0)
            event.data$Prop.Dmg.Dollars[B.rows]  <- 
                              event.data$Property.Damage[B.rows] * 1000000000
      if(sum(M.rows == TRUE)!=0)
            event.data$Prop.Dmg.Dollars[M.rows]  <- 
                              event.data$Property.Damage[M.rows] * 1000000
      if(sum(K.rows == TRUE)!=0)
            event.data$Prop.Dmg.Dollars[K.rows]  <- 
                              event.data$Property.Damage[K.rows] * 1000
      other.rows <- event.data$Prop.Dmg.Dollars == 0
      if(sum(other.rows)!=0)
            event.data$Prop.Dmg.Dollars[other.rows] <- event.data$PropertyDamage[other.rows] 

Then the Crop Damage:

   B.rows <- event.data$Crop.Damage.UOM == "B"       #' Billions
      M.rows <- event.data$Crop.Damage.UOM == "M"       #' Millions
      K.rows <- event.data$Crop.Damage.UOM == "K"       #' Thousands
      
      #' Update the values       
      if(sum(B.rows) != 0)
            event.data$Crop.Dmg.Dollars[B.rows]  <- 
                        event.data$Crop.Damage[B.rows] * 1000000000
      if(sum(M.rows) != 0)
            event.data$Crop.Dmg.Dollars[M.rows]  <- 
                        event.data$Crop.Damage[M.rows] * 1000000
      if(sum(K.rows) != 0)
            event.data$Crop.Dmg.Dollars[K.rows]  <- 
                        event.data$Crop.Damage[K.rows] * 1000
      other.rows <- event.data$Crop.Dmg.Dollars == 0
      if(sum(other.rows)!=0)
            event.data$Crop.Dmg.Dollars[other.rows] <- event.data$Crop.Damage[other.rows] 

Then finally, create columns to hold the total damage value of both property and crop damage

            #' --------create totals for people and economic impact -------------------- 
      event.data$Total.Damage.Dollars <- event.data$Crop.Dmg.Dollars + event.data$Prop.Dmg.Dollars
      event.data$Total.People.Impacted <- event.data$No.of.Fatalities + event.data$No.of.Injuries

3. Results

It was clear from the analysis following, that over the years, that the biggest impact on people’s health and wellbeing was Tornado’s followed closely by Heat and Thunderstorms. By plotting year against the all event categories and the people impacted, the picture of the dramatic difference in the impact of various types of event.

      total.category.impact <- event.data %>% 
            filter(!is.na(Event.Category)) %>% 
            group_by(Event.Category, year) %>% 
            summarise(sum(No.of.Fatalities), sum(No.of.Injuries), sum(Total.People.Impacted)) 
      colnames(total.category.impact) <- c("Event.Category", "year", "Fatalities","Injuries", 
                                          "Total.People.Impacted")
      arrange(total.category.impact, desc(Total.People.Impacted))

… and plot the data:

      #' Plot 1 - No of people affected by Event Category
      qplot(y=Event.Category, x= Total.People.Impacted, data=category.impact, color=year, 
            main="No of People Impacted(Fatalities & Injuries) by Event Category and Year",
            ylab="Event Category", xlab="Total People Impacted" )

From the figure below, which maps all Event Catories for all years, it is clear that Tornados have a huge impact on people - way above the other types of events. However, it is worth noting that from the year-wise disposition of the data, it can be seen that in the early days of data collection, there was more focus on Tornados that there is today. Most of the older, darker dots are in the Tornado category. This may indicate that there is a higher proportional impact today of other types of events that needs to be considered.

As Tornados do have such an impact, the impact on different states was another area of investigation. The figure below shows that Tornados have a significant impact accoss many states with the highest aggregation where you would expect to see in the south looking at the top states impacted.

      tornado.state.economic.impact <- event.data %>% 
            filter(Event.Category == "Tornado", Total.Damage.Dollars > 0) %>% 
            group_by(State.Code, year) %>% 
            summarise(sum(Prop.Dmg.Dollars), sum(Crop.Dmg.Dollars), sum(Total.Damage.Dollars)) 
      colnames(tornado.state.economic.impact) <- c("Event.Category","State","year",
                                                   "Total.Property.Damage",
                                                   "Total.Crop.Damage", "Total.Damage.Dollars")
      #' Plot 2 - No of people affected by Event Category and state
      qplot(y=State, x= Total.People.Impacted, data=tornado.state.impact, 
            color=Total.People.Impacted,
            main="No of People Impacted(Fatalities & Injuries) by Tornados by State",
            ylab="State", xlab="Total People Impacted" )

Given there is a significant difference in the States impacted by Tornados, the following plot shows the total economic impact on the States. This includes, for the purpose of this report, both Property and Crop damage and is shown in $ millions. The data shown is for all time periods available i.e. 1950-current.

            #' Now looking at the economic damage
      economic.impact <- event.data %>% 
            filter(!is.na(Event.Category)) %>% 
            group_by(Event.Category, State.Code, year) %>% 
            summarise(sum(Prop.Dmg.Dollars), sum(Crop.Dmg.Dollars), sum(Total.Damage.Dollars)) 
      colnames(economic.impact) <- c("Event.Category","State.Code","year","Total.Property.Damage",
                                     "Total.Crop.Damage", "Total.Damage.Dollars")
            #'and plot the data
      qplot(data=tornado.state.economic.impact, x=State, y= Total.Damage.Dollars,  
            geom="bar", stat="identity",
            main="Total Economic Damage (Property & Crops) by State",
            xlab="State", ylab="Total Damage - $ Millions" ) + 
            coord_flip() 

As can be seen from the figure above, the economic effect shows a higher impact on the states more clearly than shown in the analysis by thosed killed and injured and can be used to better understand how this might be mitigated and planned for.