Analysis of Storm Data for the United States

This analysis was made to answer important questions about weather events on the United States. The entire process consisted in cleaning and filtering the data, making some calculations and plotting.

The questions are:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

First, we will talk about how the data was processed to reach the answers.

On our first step, we will download the dataset and read it into R. We will also load the dplyr package that we will need:

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "repdata_data_StormData.csv.bz2")
storm_data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))

Next, we will do some cleaning on the EVTYPE column, that is vital to our analysis as you will see later

# Cleaning the EVTYPE
new_storm_data <- filter(storm_data, !grepl('Summary', EVTYPE))
new_storm_data <- filter(new_storm_data, !grepl('SUMMARY', EVTYPE))
new_storm_data <- new_storm_data[new_storm_data$EVTYPE != "?",]
new_storm_data[grepl("TSTM WIND", new_storm_data$EVTYPE), "EVTYPE"] <- "TSTM WIND"
new_storm_data[grepl("THUNDERSTORM WIND", new_storm_data$EVTYPE), "EVTYPE"] <- "TSTM WIND"
new_storm_data[grepl("TORNADO", new_storm_data$EVTYPE), "EVTYPE"] <- "TORNADO"

We will also extract the date and filter the data frame based on it. And the reason we will do this is because up to 1996, data was only collected for Tornado, Thunderstorm Wind and Hail were recorder. So, to make an adequate comparison, we will only use the years where data for all events were collected.

new_storm_data$Date <- strptime(new_storm_data$BGN_DATE, format = "%m/%d/%Y")
new_storm_data <- new_storm_data[new_storm_data$Date$year >= 96,]

We will also calculate the damage value. According to the directive, B corresponds to billions, M to millions and K to thousands. Any other value on the PROPDMGEXP and CROPDMGEXP will be dropped. First we will make the calculations:

if (new_storm_data$PROPDMGEXP == "b" | new_storm_data$PROPDMGEXP == "B") {
  new_storm_data$DMGVALUE <- new_storm_data$PROPDMG * 1000000000
} else if (new_storm_data$PROPDMGEXP == "m" | new_storm_data$PROPDMGEXP == "M") {
  new_storm_data$DMGVALUE <- new_storm_data$PROPDMG * 1000000
} else if (new_storm_data$PROPDMGEXP == "k" | new_storm_data$PROPDMGEXP == "K") {
  new_storm_data$DMGVALUE <- new_storm_data$PROPDMG * 1000
}

## Warning in if (new_storm_data$PROPDMGEXP == "b" | new_storm_data$PROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado

## Warning in if (new_storm_data$PROPDMGEXP == "m" | new_storm_data$PROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado

## Warning in if (new_storm_data$PROPDMGEXP == "k" | new_storm_data$PROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado

if (new_storm_data$CROPDMGEXP == "b" | new_storm_data$CROPDMGEXP == "B") {
  new_storm_data$DMGVALUE <- new_storm_data$DMGVALUE + new_storm_data$CROPDMG * 1000000000
} else if (new_storm_data$CROPDMGEXP == "m" | new_storm_data$CROPDMGEXP == "M") {
  new_storm_data$DMGVALUE <- new_storm_data$DMGVALUE + new_storm_data$CROPDMG * 1000000
} else if (new_storm_data$CROPDMGEXP == "k" | new_storm_data$CROPDMGEXP == "K") {
  new_storm_data$DMGVALUE <- new_storm_data$DMGVALUE + new_storm_data$CROPDMG * 1000
}

## Warning in if (new_storm_data$CROPDMGEXP == "b" | new_storm_data$CROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado

## Warning in if (new_storm_data$CROPDMGEXP == "m" | new_storm_data$CROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado

## Warning in if (new_storm_data$CROPDMGEXP == "k" | new_storm_data$CROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado

And then we will create a new data frame with only rows that have damage value calculated and greater than 0. This will be used to show the results about damage value. But first we will need to convert the data frame Date column to POSIXct, since dplyr filter will not work with POSIXlt

new_storm_data$Date <- as.POSIXct(new_storm_data$Date)
storm_data_with_value <- filter(new_storm_data, new_storm_data$DMGVALUE != 0)

Now, we will calculate the total number of fatalities, injuries and damage value by type of event (EVTYPE)

injuries_per_type <- tapply(new_storm_data$INJURIES, new_storm_data$EVTYPE, sum)
fatalities_per_type <- tapply(new_storm_data$FATALITIES, new_storm_data$EVTYPE, sum)
values_per_type <- tapply(storm_data_with_value$DMGVALUE, storm_data_with_value$EVTYPE, sum)

Results

Now we will check the results. First, the data about injuries and fatalities. We will select the 15 events that caused most injuries and fatalities. We will also see how much of the cases they cover in comparison to the remaining events.

worst_types_injuries <- sort(injuries_per_type, decreasing=TRUE)
worst_types_injuries <- worst_types_injuries[1:15]

worst_types_fatalities <- sort(fatalities_per_type, decreasing=TRUE)
worst_types_fatalities <- worst_types_fatalities[1:15]

sum(worst_types_fatalities) / sum(new_storm_data$FATALITIES)

## [1] 0.8508933

sum(worst_types_injuries) / sum(new_storm_data$INJURIES)

## [1] 0.9184131

Since they cover at least more than 85% of the cases, we conclude that more cleaning on the EVTYPE column could take much time and performance for an improvement that wouldn’t be significant. Let’s see what are the worst events and plot the data. First the injuries:

print(worst_types_injuries)

##           TORNADO             FLOOD    EXCESSIVE HEAT         TSTM WIND 
##             20667              6758              6391              5163 
##         LIGHTNING       FLASH FLOOD      WINTER STORM HURRICANE/TYPHOON 
##              4141              1674              1292              1275 
##              HEAT         HIGH WIND          WILDFIRE              HAIL 
##              1222              1083               911               713 
##               FOG        HEAVY SNOW  WILD/FOREST FIRE 
##               712               698               545

barplot(worst_types_injuries, las=2)

And then the fatalities:

print(worst_types_fatalities)

##          EXCESSIVE HEAT                 TORNADO             FLASH FLOOD 
##                    1797                    1511                     887 
##               LIGHTNING                   FLOOD               TSTM WIND 
##                     651                     414                     397 
##             RIP CURRENT                    HEAT               HIGH WIND 
##                     340                     237                     235 
##               AVALANCHE            RIP CURRENTS            WINTER STORM 
##                     223                     202                     191 
## EXTREME COLD/WIND CHILL            EXTREME COLD              HEAVY SNOW 
##                     125                     113                     107

barplot(worst_types_fatalities, las=2)

Finally, we will check the results about damage value. Let’s select the 15 worst events and check how much they cover.

worst_types_values <- sort(values_per_type, decreasing=TRUE)
worst_types_values <- worst_types_values[1:15]
sum(worst_types_values) / sum(new_storm_data$DMGVALUE)

## [1] 0.9644712

And then we will see which are the worst events based on damage value and plot them:

print(worst_types_values)

##        TSTM WIND      FLASH FLOOD          TORNADO             HAIL 
##       2384933920       1408629250       1278006730       1073656400 
##            FLOOD        LIGHTNING        HIGH WIND     WINTER STORM 
##        976762890        490465290        332366270        128874480 
##       HEAVY SNOW         WILDFIRE      STRONG WIND        ICE STORM 
##         90984810         87371540         64110710         58202120 
##       HEAVY RAIN   TROPICAL STORM WILD/FOREST FIRE 
##         57980550         53071940         43492970

barplot(worst_types_values, las=2)