NOAA Storm Data Analysis

Data Processing

Load first 1000 lines of original data

First we’ll read only first 1000 lines of the document, just to get brief insight of its structure and to decide what columns we do actually need for further analysis. (cache=TRUE)

df <- read.csv("./data/repdata_data_StormData.csv.bz2", nrows = 1000)
str(df)

## 'data.frame':    1000 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 449 levels "1/10/1972 0:00:00",..: 182 182 86 344 35 35 36 9 81 81 ...
##  $ BGN_TIME  : int  130 145 1600 900 1500 2000 100 900 2000 2000 ...
##  $ TIME_ZONE : Factor w/ 2 levels "CST","EST": 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 68 levels "","AUTAUGA","BALDWIN",..: 50 3 30 46 23 40 6 63 64 30 ...
##  $ STATE     : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ EVTYPE    : Factor w/ 3 levels "HAIL","TORNADO",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : logi  NA NA NA NA NA NA ...
##  $ BGN_LOCATI: logi  NA NA NA NA NA NA ...
##  $ END_DATE  : logi  NA NA NA NA NA NA ...
##  $ END_TIME  : logi  NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : logi  NA NA NA NA NA NA ...
##  $ END_LOCATI: logi  NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 3 levels "","K","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: logi  NA NA NA NA NA NA ...
##  $ WFO       : logi  NA NA NA NA NA NA ...
##  $ STATEOFFIC: logi  NA NA NA NA NA NA ...
##  $ ZONENAMES : logi  NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : logi  NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Import only necessary variables

In order to answer the questions raised in the Assignment we only need few columns to work on. Hence, we’ll import only columns that we need for further analysis:
EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP (cache=TRUE)

cls <- c(rep("NULL", 7), "factor", rep("NULL", 14), 
         "numeric", "numeric", "numeric", "factor", 
         "numeric", "factor", rep("NULL", 9))

df_storm <- read.csv("./data/repdata_data_StormData.csv.bz2", 
                   sep = ",", check.names = FALSE,
                   colClasses = cls)

Check the data structure

str(df_storm)

## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

From the above we can see that levels of the factors are some kind of messy and that they not match with the official Event types and multipliers.

For example, EVTYPE has 985 levels, while it should have only 47 (official events). Also, multiplier columns contain some punctuation characters and numbers, while only acceptable levels are B, M, K and H.

Clean up the factor levels

Tightening event names

First we will work on Event names through matching existing names in EVTYPE column and creation of ‘events’ column which contains only official event types which are soundly matched pairs of EVTYPE.

# Create vector of official event names taken from the Storm Data preparation document
events_off <- c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD",
            "COASTAL FLOOD","COLD/WIND CHILL","DEBRIS FLOW",
            "DENSE FOG","DENSE SMOKE","DROUGHT","DUST DEVIL",
            "DUST STORM","EXCESSIVE HEAT","EXTREME COLD/WIND CHILL",
            "FLASH FLOOD","FLOOD","FREEZING FOG","FROST/FREEZE","FUNNEL CLOUD",
            "HAIL","HEAT","HEAVY RAIN","HEAVY SNOW","HIGH SURF","HIGH WIND",
            "HURRICANE/TYPHOON","ICE STORM","LAKESHORE FLOOD","LAKE-EFFECT SNOW",
            "LIGHTNING","MARINE HAIL","MARINE HIGH WIND","MARINE STRONG WIND",
            "MARINE THUNDERSTORM WIND","RIP CURRENT","SEICHE","SLEET",
            "STORM TIDE","STRONG WIND","THUNDERSTORM WIND","TORNADO",
            "TROPICAL DEPRESSION","TROPICAL STORM","TSUNAMI","VOLCANIC ASH",
            "WATERSPOUT","WILDFIRE","WINTER STORM","WINTER WEATHER")

# Change the class of EVTYPE variable
df_storm$EVTYPE <- as.character(df_storm$EVTYPE)

# Create new variable 'events' soundly mathcing EVTYPE names with official event names, where similar sounding strings should get similar or equal codes
library(dplyr)
library(stringdist)
df_storm <- df_storm %>% mutate(events = 
                events_off[amatch(df_storm$EVTYPE, 
                events_off ,method="soundex", maxDist=2)])

Check for multipliers levels

levels(df_storm$PROPDMGEXP)

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

levels(df_storm$CROPDMGEXP)

## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Tightening property and crop damage multipliers

We will remove numbers and punctuation characters and set to uppercase

levels(df_storm$PROPDMGEXP) <- gsub(pattern = "\\d|[[:punct:]]", 
                replacement = "", x = levels(df_storm$PROPDMGEXP))
levels(df_storm$PROPDMGEXP) <- toupper(levels(df_storm$PROPDMGEXP))
levels(df_storm$CROPDMGEXP) <- gsub(pattern = "\\d|[[:punct:]]", 
                replacement = "", x = levels(df_storm$CROPDMGEXP))
levels(df_storm$CROPDMGEXP) <- toupper(levels(df_storm$CROPDMGEXP))

Observe how levels of multipliers look after cleaning

levels(df_storm$PROPDMGEXP)

## [1] ""  "B" "H" "K" "M"

levels(df_storm$CROPDMGEXP)

## [1] ""  "B" "K" "M"

Replace multiplier symbol with power number and calculate total property damage

df_storm <- mutate(df_storm, 
               prop_exp = ifelse(PROPDMGEXP == 'B', 9,
                          ifelse(PROPDMGEXP == 'M', 6,
                          ifelse(PROPDMGEXP == 'K', 3,
                          ifelse(PROPDMGEXP == 'H', 2, 0) ) ) ),
               crop_exp = ifelse(CROPDMGEXP == 'B', 9,
                          ifelse(CROPDMGEXP == 'M', 6,
                          ifelse(CROPDMGEXP == 'K', 3,
                          ifelse(CROPDMGEXP == 'H', 2, 0) ) ) ),
               
               total = PROPDMG * (10^prop_exp) + CROPDMG * (10^crop_exp)
)

Results

Create separate tables for fatalities and injuries

fatal <- df_storm %>% group_by(events) %>% 
        summarise(sum_fatal = sum(FATALITIES)) %>%
        arrange(desc(sum_fatal))
injur <- df_storm %>% group_by(events) %>% 
        summarise(sum_inj = sum(INJURIES)) %>%
        arrange(desc(sum_inj))

Graphical presentation of the first ten events that events are most harmful to population health.

After showing the results for the question about population harm, we can see that the event type that causes the most population harm is TORNADO.

Create table of total property damage

total_tbl <- df_storm %>% group_by(events) %>% 
        summarise(sum_total = sum(total)) %>%
        arrange(desc(sum_total))
head(total_tbl)

## Source: local data frame [6 x 2]
## 
##                  events    sum_total
##                   (chr)        (dbl)
## 1                 FLOOD 150319678257
## 2     HURRICANE/TYPHOON  90271472810
## 3               TORNADO  58959393949
## 4            STORM TIDE  48213406740
## 5 ASTRONOMICAL LOW TIDE  19738579082
## 6                  HAIL  18758864016

Graphical presentation of the first ten events that causes the most economic harm.

After showing the results for the question economic harm, we can see that the event type that causes the most economic harm is FLOOD.