Synopsis

This is the report for the second assignment for Reproducible Research course, october 2014. The aim is to study the NOAA dataset in order to inspect what type of events cause more fatalities and economic injuries.

Our approach is to produce a tidy data set where event categories have been unified (though some will remain uncleaned), which also includes information about frequency, and fatalities and injuries per event. We will try answer the questions by making plots using this tidy data set.

Data processing

In this chapter we will read and proccess data.

Reading data

NOAA comes icsv format; the following code reads and loads the file into the workspace.

Set working directory and load libraries.

#set working directory
setwd("~/Documents/coursera/DataAnalysis/5ReproducibleResearch/RepData_PeerAssessment2")
#libraries
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read data.

#read data. Note there's no need to decompress the file.
storm <- read.csv(file = "repdata-data-StormData.csv.bz2", header = TRUE)
attach(storm)
## The following object is masked from package:base:
## 
##     F

Important variables are:

  • EVTYPE: event type, with 985! categories
  • FATALITIES: measure of impact on health
  • INJURIES: measure of economic impact

Approach

For measuring impact on health and economic impact the frequency of the event (i.e. probability of occurrence) as well as impact (from both aspects i.e. costs and health impact) have to be taken into account.

We will construct a tidy data set that will contain one row per event, and:

  • a colum with event names, where equivalent events have been aggregated
  • a column of frequencies (probability of occurrence)
  • a column of aggregated fatalities for each event
  • a column of aggregated injuries for each event

Basically we will transform the original dataset to obtain one row per event, and then add the other columns.

The “important events” will be selected based on 3 criteria:

  1. most frequent events
  2. events with more fatalities
  3. events with more injuries
length(levels(storm$EVTYPE))
## [1] 985

There are 985 distint events. In the barplot (not shown here) we see only some of them are really frequent. We will filter events occuring more than 1000 times in the dataset.

df <- as.data.frame(table(storm$EVTYPE)) #convert frequency table to data frame
barplot(table(storm$EVTYPE))

plot of chunk frequencies2

# select events occurring more than 1000 times (arbitrary)
dffreq1000 <- df[which(df$Freq> 1000),] 
frequent_events <- as.character(dffreq1000$EVTYPE)
frequent_events
## character(0)

We see some levels refer to the same type of and need to be aggregated.

First aggregation of similar categories

For the selected events we aggregate those that seem to refer to the same concept.

Makes sense to aggregate:

  • “BLIZZARD”, “ICE STORM” to unique: “WINTER STORM”
  • “FLASH FLOOD”, " URBAN/SML STREAM FLD" to unique: “FLOOD”
  • “HIGH WINDS” to unique: “HIGH WIND”
  • “EXTREME COLD/WIND CHILL” to unique: “FROST/FREEZE”
  • “MARINE THUNDERSTORM WIND” , “MARINE TSTM WIND”, “THUNDERSTORM WINDS”, “TSTM WIND” to unique: “THUNDERSTORM WIND”
  • “WILD/FOREST FIRE” to unique: “WILDFIRE”
  • “WINTER WEATHER/MIX” to unique: “WINTER WEATHER”
#create a new vector with aggregated events
#Aggregate similar categories
etype <- gsub(EVTYPE, pattern = ("BLIZZARD|ICE STORM"), replacement = "WINTER STORM")
etype <- gsub(etype, pattern = ("FLASH FLOOD|URBAN/SML STREAM FLD"), replacement = "FLOOD")
etype <- gsub(etype, pattern = ("HIGH WINDS"), replacement = "HIGH WIND")
etype <- gsub(etype, pattern = ("EXTREME COLD/WIND CHILL"), replacement = "FROST/FREEZE")
etype <- gsub(etype, pattern = ("MARINE THUNDERSTORM WIND|MARINE TSTM WIND|THUNDERSTORM WINDS|TSTM WIND"), replacement = "THUNDERSTORM WIND")
etype <- gsub(etype, pattern = ("WILD/FOREST FIRE"), replacement = "WILDFIRE")
etype <- gsub(etype, pattern = ("WINTER WEATHER/MIX"), replacement = "WINTER WEATHER")

Updates.

# update column
storm$etype <- etype # in original data
# update frequent_events
df <- as.data.frame(table(etype)) %>% arrange(desc(Freq))
df30 <- head(df, n=30)# select events occurring more than 1000 times
frequent_events <- as.character(df30$etype)
frequent_events
##  [1] "THUNDERSTORM WIND"      "HAIL"                  
##  [3] "FLOOD"                  "TORNADO"               
##  [5] "HIGH WIND"              "WINTER STORM"          
##  [7] "LIGHTNING"              "HEAVY SNOW"            
##  [9] "HEAVY RAIN"             "WINTER WEATHER"        
## [11] "FUNNEL CLOUD"           "WILDFIRE"              
## [13] "WATERSPOUT"             "STRONG WIND"           
## [15] "DROUGHT"                "FROST/FREEZE"          
## [17] "EXCESSIVE HEAT"         "DENSE FOG"             
## [19] "THUNDERSTORM WIND/HAIL" "FLOODING"              
## [21] "HEAT"                   "HIGH SURF"             
## [23] "TROPICAL STORM"         "EXTREME COLD"          
## [25] "COASTAL FLOOD"          "FLOOD/FLOOD"           
## [27] "LAKE-EFFECT SNOW"       "LANDSLIDE"             
## [29] "SNOW"                   "COLD/WIND CHILL"

Some more aggregation in the frequent_events list:

  • “FLOOD/FLOOD”, “COASTAL FLOOD”, “FLOODING” to “FLOOD”
  • “HAIL”, “THUNDERSTORM WIND” to “THUNDERSTORM WIND/HAIL”
  • “SNOW” to “HEAVY SNOW”
  • “EXTREME COLD” to “FROST/FREEZE”
  • “EXCESSIVE HEAT” to “HEAT”
  • “EXTREME COLD” to “FROST/FREEZE”
frequent_events <- gsub(frequent_events, pattern = ("FLOOD/FLOOD|COASTAL FLOOD|FLOODING"), replacement = "FLOOD")
etype <- gsub(etype, pattern = ("FLOOD/FLOOD|COASTAL FLOOD|FLOODING"), replacement = "FLOOD")
frequent_events <- gsub(frequent_events, pattern = ("HAIL|THUNDERSTORM WIND"), replacement = "THUNDERSTORM WIND/HAIL")
etype <- gsub(etype, pattern = ("HAIL|THUNDERSTORM WIND"), replacement = "THUNDERSTORM WIND/HAIL")
frequent_events <- gsub(frequent_events, pattern = ("EXCESSIVE HEAT"), replacement = "HEAT")
etype <- gsub(etype, pattern = ("EXCESSIVE HEAT"), replacement = "HEAT")
frequent_events <- gsub(frequent_events, pattern = ("EXTREME COLD"), replacement = "FROST/FREEZE")
etype <- gsub(etype, pattern = ("EXTREME COLD"), replacement = "FROST/FREEZE")
# list of unique events
unique(frequent_events) # 24 more-or-less distinct events
##  [1] "THUNDERSTORM WIND/HAIL"                       
##  [2] "FLOOD"                                        
##  [3] "TORNADO"                                      
##  [4] "HIGH WIND"                                    
##  [5] "WINTER STORM"                                 
##  [6] "LIGHTNING"                                    
##  [7] "HEAVY SNOW"                                   
##  [8] "HEAVY RAIN"                                   
##  [9] "WINTER WEATHER"                               
## [10] "FUNNEL CLOUD"                                 
## [11] "WILDFIRE"                                     
## [12] "WATERSPOUT"                                   
## [13] "STRONG WIND"                                  
## [14] "DROUGHT"                                      
## [15] "FROST/FREEZE"                                 
## [16] "HEAT"                                         
## [17] "DENSE FOG"                                    
## [18] "THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"
## [19] "HIGH SURF"                                    
## [20] "TROPICAL STORM"                               
## [21] "LAKE-EFFECT SNOW"                             
## [22] "LANDSLIDE"                                    
## [23] "SNOW"                                         
## [24] "COLD/WIND CHILL"

Update values in data frame.

# update column
storm$etype <- etype # in original data
# update frequent_events
df <- as.data.frame(table(storm$etype)) %>% arrange(desc(Freq))
frequent_events <- unique(frequent_events) # selected 24

Events with more fatalities

For each event, compute the sum of fatalities and create a new data frame in descending order of number of fatalities.

Instead of the original variabe, EVTYPE, we now use variable etype where some “similar” levels have been aggregated.

df2 <- storm %>% group_by(etype) %>% summarise(sum_fatalities=sum(FATALITIES)) %>% arrange(desc(sum_fatalities))
#take the top-30 (same criterion as with frequent events)
df230 <- head(df2, n = 30)
fatal_events <- as.character(df230$etype)
#see them
fatal_events
##  [1] "TORNADO"                                                  
##  [2] "HEAT"                                                     
##  [3] "FLOOD"                                                    
##  [4] "LIGHTNING"                                                
##  [5] "THUNDERSTORM WIND/HAIL"                                   
##  [6] "WINTER STORM"                                             
##  [7] "RIP CURRENT"                                              
##  [8] "FROST/FREEZE"                                             
##  [9] "HIGH WIND"                                                
## [10] "AVALANCHE"                                                
## [11] "RIP CURRENTS"                                             
## [12] "HEAT WAVE"                                                
## [13] "HEAVY SNOW"                                               
## [14] "STRONG WIND"                                              
## [15] "HIGH SURF"                                                
## [16] "HEAVY RAIN"                                               
## [17] "EXTREME HEAT"                                             
## [18] "COLD/WIND CHILL"                                          
## [19] "WILDFIRE"                                                 
## [20] "HURRICANE/TYPHOON"                                        
## [21] "FOG"                                                      
## [22] "HURRICANE"                                                
## [23] "WINTER WEATHER"                                           
## [24] "TROPICAL STORM"                                           
## [25] "HEAVY SURF/HIGH SURF"                                     
## [26] "LANDSLIDE"                                                
## [27] "COLD"                                                     
## [28] "TSUNAMI"                                                  
## [29] "UNSEASONABLY WARM AND DRY"                                
## [30] "TORNADOES, THUNDERSTORM WIND/HAIL, THUNDERSTORM WIND/HAIL"

New aggregations needed.

Second aggregation

Makes sense to aggregate:

  • “EXCESSIVE HEAT”, “EXTREME HEAT”, “HEAT WAVE” to unique: “HEAT”
  • “RIP CURRENTS” to unique: “RIP CURRENT”
  • “EXTREME COLD”, “COLD” to unique: “FROST/FREEZE”
  • “HURRICANE/TYPHOON” to unique: “HURRICANE”
  • “HEAVY SURF/HIGH SURF” to unique: “HIGH SURF”
#create a new column with aggregated events
#Aggregate similar categories
etype <- gsub(etype, pattern = ("EXCESSIVE HEAT|EXTREME HEAT|HEAT WAVE"), replacement = "HEAT")
fatal_events <- gsub(fatal_events, pattern = ("EXCESSIVE HEAT|EXTREME HEAT|HEAT WAVE"), replacement = "HEAT")
etype <- gsub(etype, pattern = ("RIP CURRENTS"), replacement = "RIP CURRENT")
fatal_events <- gsub(fatal_events, pattern = ("RIP CURRENTS"), replacement = "RIP CURRENT")
etype <- gsub(etype, pattern = ("EXTREME COLD|COLD"), replacement = "FROST/FREEZE")
fatal_events <- gsub(fatal_events, pattern = ("RIP CURRENTS"), replacement = "RIP CURRENT")
etype <- gsub(etype, pattern = ("HURRICANE/TYPHOON"), replacement = "HURRICANE")
fatal_events <- gsub(fatal_events, pattern = ("HURRICANE/TYPHOON"), replacement = "HURRICANE")
etype <- gsub(etype, pattern = ("HEAVY SURF/HIGH SURF"), replacement = "HIGH SURF")
fatal_events <- gsub(fatal_events, pattern = ("HURRICANE/TYPHOON"), replacement = "HURRICANE")

Updates.

# update column
storm$etype <- etype # in original data
# update events list
fatal_events <- unique(fatal_events)
fatal_events # 26 events
##  [1] "TORNADO"                                                  
##  [2] "HEAT"                                                     
##  [3] "FLOOD"                                                    
##  [4] "LIGHTNING"                                                
##  [5] "THUNDERSTORM WIND/HAIL"                                   
##  [6] "WINTER STORM"                                             
##  [7] "RIP CURRENT"                                              
##  [8] "FROST/FREEZE"                                             
##  [9] "HIGH WIND"                                                
## [10] "AVALANCHE"                                                
## [11] "HEAVY SNOW"                                               
## [12] "STRONG WIND"                                              
## [13] "HIGH SURF"                                                
## [14] "HEAVY RAIN"                                               
## [15] "COLD/WIND CHILL"                                          
## [16] "WILDFIRE"                                                 
## [17] "HURRICANE"                                                
## [18] "FOG"                                                      
## [19] "WINTER WEATHER"                                           
## [20] "TROPICAL STORM"                                           
## [21] "HEAVY SURF/HIGH SURF"                                     
## [22] "LANDSLIDE"                                                
## [23] "COLD"                                                     
## [24] "TSUNAMI"                                                  
## [25] "UNSEASONABLY WARM AND DRY"                                
## [26] "TORNADOES, THUNDERSTORM WIND/HAIL, THUNDERSTORM WIND/HAIL"

Events with more injuries

We proceed the same way as with fatalities.

df3 <- storm %>% group_by(etype) %>% summarise(sum_injuries=sum(INJURIES)) %>% arrange(desc(sum_injuries))
#take the top-30 (same criterion)
df330 <- head(df3, n = 30)
injury_events <- df330$etype
injury_events
##  [1] "TORNADO"                                      
##  [2] "THUNDERSTORM WIND/HAIL"                       
##  [3] "HEAT"                                         
##  [4] "FLOOD"                                        
##  [5] "LIGHTNING"                                    
##  [6] "WINTER STORM"                                 
##  [7] "WILDFIRE"                                     
##  [8] "HIGH WIND"                                    
##  [9] "HURRICANE"                                    
## [10] "HEAVY SNOW"                                   
## [11] "FOG"                                          
## [12] "RIP CURRENT"                                  
## [13] "WINTER WEATHER"                               
## [14] "DUST STORM"                                   
## [15] "DENSE FOG"                                    
## [16] "TROPICAL STORM"                               
## [17] "FROST/FREEZE"                                 
## [18] "STRONG WIND"                                  
## [19] "HEAVY RAIN"                                   
## [20] "GLAZE"                                        
## [21] "HIGH SURF"                                    
## [22] "AVALANCHE"                                    
## [23] "WILD FIRES"                                   
## [24] "ICE"                                          
## [25] "TSUNAMI"                                      
## [26] "THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"
## [27] "WIND"                                         
## [28] "WINTRY MIX"                                   
## [29] "Heat Wave"                                    
## [30] "WINTER WEATHER MIX"

Third aggregation

Makes sense to aggregate:

  • “DENSE FOG” to unique: “FOG”
  • “ICE” to unique: “FROST/FREEZE”
  • “HAIL”, “THUNDERSTORM WIND”, “THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL” to unique: “THUNDERSTORM WIND/HAIL”
  • “Heat wave” to unique: “HEAT”
#create a new column with aggregated events
#Aggregate similar categories
etype <- gsub(etype, pattern = ("DENSE FOG"), replacement = "FOG")
injury_events <- gsub(injury_events, pattern = ("DENSE FOG"), replacement = "FOG")
etype <- gsub(etype, pattern = ("ICE"), replacement = "FROST/FREEZE")
injury_events <- gsub(injury_events, pattern = ("ICE"), replacement = "FROST/FREEZE")
etype <- gsub(etype, pattern = ("HAIL|THUNDERSTORM WIND|THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"), replacement = "THUNDERSTORM WIND/HAIL")
injury_events <- gsub(injury_events, pattern = ("HAIL|THUNDERSTORM WIND|THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"), replacement = "THUNDERSTORM WIND/HAIL")
etype <- gsub(etype, pattern = ("Heat Wave"), replacement = "HEAT")
injury_events <- gsub(injury_events, pattern = ("Heat Wave"), replacement = "HEAT")

Updates

# update column
storm$etype <- etype # in original data
#
# final list
injury_events <- unique(injury_events)

Create tidy data set

Create data where each row is a event and we also have a variable with its importance.

Important note

For now, we have identified some important levels and aggregated some of them to unified categories. However, categories with small frequencies will remain that, when aggregated, could reach levels equivalent to some of the categories selected as important categories.

In order to avoid these type of bias the categories should be cleaned further, for example, by applying some stemming algorithm from text mining. We don’t do this for the assignment.

This is the selected list of events. .

#identified categories for the moment
all_events <- unique(c(frequent_events,fatal_events, injury_events))
all_events
##  [1] "THUNDERSTORM WIND/HAIL"                                   
##  [2] "FLOOD"                                                    
##  [3] "TORNADO"                                                  
##  [4] "HIGH WIND"                                                
##  [5] "WINTER STORM"                                             
##  [6] "LIGHTNING"                                                
##  [7] "HEAVY SNOW"                                               
##  [8] "HEAVY RAIN"                                               
##  [9] "WINTER WEATHER"                                           
## [10] "FUNNEL CLOUD"                                             
## [11] "WILDFIRE"                                                 
## [12] "WATERSPOUT"                                               
## [13] "STRONG WIND"                                              
## [14] "DROUGHT"                                                  
## [15] "FROST/FREEZE"                                             
## [16] "HEAT"                                                     
## [17] "DENSE FOG"                                                
## [18] "THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"            
## [19] "HIGH SURF"                                                
## [20] "TROPICAL STORM"                                           
## [21] "LAKE-EFFECT SNOW"                                         
## [22] "LANDSLIDE"                                                
## [23] "SNOW"                                                     
## [24] "COLD/WIND CHILL"                                          
## [25] "RIP CURRENT"                                              
## [26] "AVALANCHE"                                                
## [27] "HURRICANE"                                                
## [28] "FOG"                                                      
## [29] "HEAVY SURF/HIGH SURF"                                     
## [30] "COLD"                                                     
## [31] "TSUNAMI"                                                  
## [32] "UNSEASONABLY WARM AND DRY"                                
## [33] "TORNADOES, THUNDERSTORM WIND/HAIL, THUNDERSTORM WIND/HAIL"
## [34] "DUST STORM"                                               
## [35] "GLAZE"                                                    
## [36] "WILD FIRES"                                               
## [37] "WIND"                                                     
## [38] "WINTRY MIX"                                               
## [39] "WINTER WEATHER MIX"

Last aggregation

  • “DENSE FOG” to “FOG”
etype <- gsub(etype, pattern = ("DENSE FOG"), replacement = "FOG")
all_events <- gsub(all_events, pattern = ("DENSE FOG"), replacement = "FOG")
all_events <- unique(all_events)
all_events
##  [1] "THUNDERSTORM WIND/HAIL"                                   
##  [2] "FLOOD"                                                    
##  [3] "TORNADO"                                                  
##  [4] "HIGH WIND"                                                
##  [5] "WINTER STORM"                                             
##  [6] "LIGHTNING"                                                
##  [7] "HEAVY SNOW"                                               
##  [8] "HEAVY RAIN"                                               
##  [9] "WINTER WEATHER"                                           
## [10] "FUNNEL CLOUD"                                             
## [11] "WILDFIRE"                                                 
## [12] "WATERSPOUT"                                               
## [13] "STRONG WIND"                                              
## [14] "DROUGHT"                                                  
## [15] "FROST/FREEZE"                                             
## [16] "HEAT"                                                     
## [17] "FOG"                                                      
## [18] "THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"            
## [19] "HIGH SURF"                                                
## [20] "TROPICAL STORM"                                           
## [21] "LAKE-EFFECT SNOW"                                         
## [22] "LANDSLIDE"                                                
## [23] "SNOW"                                                     
## [24] "COLD/WIND CHILL"                                          
## [25] "RIP CURRENT"                                              
## [26] "AVALANCHE"                                                
## [27] "HURRICANE"                                                
## [28] "HEAVY SURF/HIGH SURF"                                     
## [29] "COLD"                                                     
## [30] "TSUNAMI"                                                  
## [31] "UNSEASONABLY WARM AND DRY"                                
## [32] "TORNADOES, THUNDERSTORM WIND/HAIL, THUNDERSTORM WIND/HAIL"
## [33] "DUST STORM"                                               
## [34] "GLAZE"                                                    
## [35] "WILD FIRES"                                               
## [36] "WIND"                                                     
## [37] "WINTRY MIX"                                               
## [38] "WINTER WEATHER MIX"

Now we compute the tidy data set.

data_final <-  as.data.frame(table(storm$etype))
names(data_final) <- c("etype", "freq")
df_selected <- data.frame(etype = all_events, selected = 1)
data_final <- merge(data_final, df_selected, by="etype", all.x = TRUE, all.y=TRUE)

Finally merge tables of fatalities and injuries.

names(data_final)
## [1] "etype"    "freq"     "selected"
data_final <- merge(data_final, df2, by="etype", all.x = TRUE)
data_final <- merge(data_final, df3, by="etype", all.x = TRUE)
data_final <- data_final[!is.na(data_final$selected),]

Results

This section aims to give a satisfactory answer to the 2 questions addressed in the assignment instructions.

Question 1: events harmful for health

With the tidy data set, in this sub-section we will try to give an answer to this question:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

library(ggplot2)
data <- data_final %>% arrange(desc(freq))
ggplot(data, aes(x=etype, y = freq)) + geom_bar(position="dodge",stat="identity") +
        coord_flip() + geom_point(aes(etype, sum_fatalities), colour="red")

plot of chunk unnamed-chunk-7

Some levels have to be aggregated still, but overall we see results. Tornado is the most fatal event. “Thunderstorm” is so frequent that masks everything else. Now we remove it and repeat the plot.

ggplot(subset(data, etype != "THUNDERSTORM WIND/HAIL/THUNDERSTORM WIND/HAIL"), aes(x=etype, y = freq)) + geom_bar(position="dodge",stat="identity") + coord_flip() + geom_point(aes(etype, sum_fatalities), colour="red") 

Tornadoes show more total fatalities than floods even if they are less frequent. A normalization would be necessary in order to see fatalities independet of frequency.

Question 2: events with greatest economic impact

In this sub-section we will try to give an answer to this question:

Across the United States, which types of events have the greatest economic consequences?

ggplot(data, aes(x=etype, y = freq)) + geom_bar(position="dodge",stat="identity") +
        coord_flip() + geom_point(aes(etype, sum_injuries), colour="blue") 
## Warning: Removed 3 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-9

The conclusions are similar, the wind seems to be a economic problem.