Summary

Weather events like storms and other severe weather issues constantly batters north america affecting both people and properties. Public health and economic problems for communities and municipalities become frequent and many of those events can result in fatalities, injuries, and several property damage. preventing such outcomes to the extent possible is a key concern.

This report involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

According to the analysis there are two significant type of events that harms both public heatlth and properties being tornados and floods respectively. Since 1950 until 2011 Around USD$ 179.908.359.644 dollars concerns to property and crops repairment for flood damages and total the expenses rise up to USD$ 47.634.600.000.000.000. Also, the public health damages cases caused by tornados in the same period rise to 97.015 and including other causes a total 155.609 cases.

Data Processing

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file by clicking here. Also the code you can use in order to download the file is as follows:

download.file(
url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile= "StormData.csv.bz2", quiet = TRUE)

The download file command is included in the “utils” library (Other necesary libraries are listed in the appendix) and it requires the “url”" and destination file name “destfile”(no path is required if you properly set your working directory before), the additinal parameter “quiet” is set to TRUE to suppress status messages, and the progress bar. It will take a couple minutes to download the file (about 48MB) depending on your internet connection.

First, the data set has to be uncompressed. For this, the read.csv command will be useful, but let’s just bring the first 6 row in order to recognize the features.

rawStormData <- read.csv("StormData.csv.bz2", header = TRUE, sep = ",", nrows = 6)
str(rawStormData)
## 'data.frame':    6 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1
##  $ BGN_DATE  : Factor w/ 4 levels "11/15/1951 0:00:00",..: 3 3 2 4 1 1
##  $ BGN_TIME  : int  130 145 1600 900 1500 2000
##  $ TIME_ZONE : Factor w/ 1 level "CST": 1 1 1 1 1 1
##  $ COUNTY    : num  97 3 57 89 43 77
##  $ COUNTYNAME: Factor w/ 6 levels "BALDWIN","CULLMAN",..: 6 1 3 5 2 4
##  $ STATE     : Factor w/ 1 level "AL": 1 1 1 1 1 1
##  $ EVTYPE    : Factor w/ 1 level "TORNADO": 1 1 1 1 1 1
##  $ BGN_RANGE : num  0 0 0 0 0 0
##  $ BGN_AZI   : logi  NA NA NA NA NA NA
##  $ BGN_LOCATI: logi  NA NA NA NA NA NA
##  $ END_DATE  : logi  NA NA NA NA NA NA
##  $ END_TIME  : logi  NA NA NA NA NA NA
##  $ COUNTY_END: num  0 0 0 0 0 0
##  $ COUNTYENDN: logi  NA NA NA NA NA NA
##  $ END_RANGE : num  0 0 0 0 0 0
##  $ END_AZI   : logi  NA NA NA NA NA NA
##  $ END_LOCATI: logi  NA NA NA NA NA NA
##  $ LENGTH    : num  14 2 0.1 0 0 1.5
##  $ WIDTH     : num  100 150 123 100 150 177
##  $ F         : int  3 2 2 2 2 2
##  $ MAG       : num  0 0 0 0 0 0
##  $ FATALITIES: num  0 0 0 0 0 0
##  $ INJURIES  : num  15 0 2 2 2 6
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5
##  $ PROPDMGEXP: Factor w/ 1 level "K": 1 1 1 1 1 1
##  $ CROPDMG   : num  0 0 0 0 0 0
##  $ CROPDMGEXP: logi  NA NA NA NA NA NA
##  $ WFO       : logi  NA NA NA NA NA NA
##  $ STATEOFFIC: logi  NA NA NA NA NA NA
##  $ ZONENAMES : logi  NA NA NA NA NA NA
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : logi  NA NA NA NA NA NA
##  $ REFNUM    : num  1 2 3 4 5 6

Looks there are some variables that can be useful but its name is not that verbose. As it can be seen we can see variable EVTYPE stands for event type and it has 985 events. Also fatalities, and injuries could be useful to quantify population health harm. Variables PROPDMG, CROPDMG stands for property and crop damage respectively. Nevertheles PROPDMGEXP and CROPDMGEXP seems related to the prior ones. Let’s see these levels.

Levels for PROPDMGEXP

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

Levels for CROPDMGEXP

## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Variables PROPDMGEXP and CROPDMGEXP should be kept in mind as they store the scale for each value in PROPDMG and CROPDMG (e.g millions, billions ). This assumption comes up after reading the National Weather Service Documentation especially the APPENDIX-B.

propDmgs <- select(PROPDMG ,PROPDMGEXP,.data =rawStormData )
exptfilt <- filter(propDmgs, as.factor(PROPDMGEXP) 
                   %in% c("-","?","+","1","2","3","4","5","6","7","8","9"))
summary(exptfilt)
##     PROPDMG         PROPDMGEXP
##  Min.   : 0.000   5      :28  
##  1st Qu.: 0.000   1      :25  
##  Median : 0.000   2      :13  
##  Mean   : 5.469   ?      : 8  
##  3rd Qu.: 2.200   +      : 5  
##  Max.   :88.000   7      : 5  
##                   (Other):14

The numbers of the records with these level factors are no more than 100 and in fact there is no evidence about how exactly these Exponents should be used (either ( x^i ) or ( x*10^i )). Then for this excersice these records would be ommited. The same procedure is conducted for CROPDMG and CROPDMGEXP variables.

##     CROPDMG    CROPDMGEXP
##  Min.   :0   ?      :7   
##  1st Qu.:0   2      :1   
##  Median :0          :0   
##  Mean   :0   0      :0   
##  3rd Qu.:0   B      :0   
##  Max.   :0   k      :0   
##              (Other):0

Lets print a summary for the other levels of PROPDMGEXP.

propDmgs <- select(PROPDMG ,PROPDMGEXP,.data =rawStormData )
exptfilt <- filter(propDmgs, as.factor(PROPDMGEXP) 
                   %in% c("B","h","H","K","m","M"))
summary(exptfilt)
##     PROPDMG          PROPDMGEXP    
##  Min.   :   0.00   K      :424665  
##  1st Qu.:   0.00   M      : 11330  
##  Median :   1.00   B      :    40  
##  Mean   :  24.94   m      :     7  
##  3rd Qu.:  10.00   H      :     6  
##  Max.   :5000.00   h      :     1  
##                    (Other):     0

For the levels “B”,“h”,“H”,“K”,“m” and “M” would stand for Billion, Hecto, Kilo and Million respectively, then each vale is goint to be multiplied by the respective number being H = 100, K = 1000, M = 1000000, B = 1000000000. Values with non character will be just mutiplied by one. The result data set would be as follows:

EVTYPE FATALITIES INJURIES HEALTHDMG ECONOMICC
TORNADO 0 15 15 25000
TORNADO 0 0 0 2500
TORNADO 0 2 2 25000
TORNADO 0 2 2 2500
TORNADO 0 2 2 2500
TORNADO 0 6 6 2500

Where HEALTHDMG and ECONOMICC stands for health damages and economic consecuentes(USD) respectively.

Now there are around 9 hundred types of event, most of them are related but written in different ways. It is necesary to gather them into common groups in order to plot their data properly. Storm data event table on page 6 of *STORM DATA PREPARATION list the events which this gathering is based on The are event types will fall into the following groups:

  1. Flood 2. Fire 3. Heat 4. Tornado 5. Rain 6. Storm 6. Wind 7. Heat
  2. Hail 9. Mud slides 10. Other

Additionally the data is grouped and set into new variables to get them into the plots as follows:

propertyDamageData <- cleanData %>%
                      group_by(EVTYPE) %>%
                      summarise(PROPDMG = sum(ECONOMICC)) %>%
                      arrange(PROPDMG)
propertyDamageData <- transform(propertyDamageData, 
                                EVTYPE = reorder(EVTYPE, PROPDMG))

healthDamageData   <- cleanData %>%
                      group_by(EVTYPE) %>%
                      summarise(HEALTHDMG = sum(HEALTHDMG)) %>%
                      arrange(HEALTHDMG)
healthDamageData <- transform(healthDamageData, 
                                EVTYPE = reorder(EVTYPE, HEALTHDMG))

Results

Since 1950 until 2011 Around USD$ 179.908.359.644 dollars concerns to property and crops repairment for flood damages and total the expenses rise up to USD$ 47.634.600.000.000.000.

Since 1950 until 2011 the public health damages cases caused by tornados rise to 97.015 and including other causes, a total 155.609 cases.

Appendix

This code is for the initial set up and required libraries:

knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(knitr)
library(ggplot2)
library(utils)

This code is for data downloading (seen previosly):

This code is for the data loading into a variable:

rawStormData <- read.csv("StormData.csv.bz2", header = TRUE, sep = ",")

This code is for property damages exp list:

levels(rawStormData$PROPDMGEXP)

This code is for property damages exp list:

levels(rawStormData$CROPDMGEXP)

This code is for data cleaning on exp literals:

cleanData <- select(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,
                    CROPDMG,CROPDMGEXP,.data=rawStormData)
cleanData <- filter(cleanData, 
                    as.factor(PROPDMGEXP) 
                    %in% c("","B","b","h","H","K","k","M","m"),
                    as.factor(CROPDMGEXP) 
                    %in% c("","B","b","h","H","K","k","M","m"))
cleanData$PROPDE[cleanData$PROPDMGEXP == ""] <- 1
cleanData$PROPDE[cleanData$PROPDMGEXP %in%  c("h","H")] <- 100
cleanData$PROPDE[cleanData$PROPDMGEXP %in%  c("K","k")] <- 1000
cleanData$PROPDE[cleanData$PROPDMGEXP %in%  c("M","m")] <- 1000000
cleanData$PROPDE[cleanData$PROPDMGEXP %in%  c("B","b")] <- 1000000000
cleanData$CROPDE[cleanData$CROPDMGEXP == ""] <- 1
cleanData$CROPDE[cleanData$CROPDMGEXP %in%  c("h","H")] <- 100
cleanData$CROPDE[cleanData$CROPDMGEXP %in%  c("K","k")] <- 1000
cleanData$CROPDE[cleanData$CROPDMGEXP %in%  c("M","m")] <- 1000000
cleanData$CROPDE[cleanData$CROPDMGEXP %in%  c("B","b")] <- 1000000000
cleanData$PROPDMG <- cleanData$PROPDMG * cleanData$PROPDE
cleanData$CROPDMG <- cleanData$CROPDMG * cleanData$CROPDE
cleanData$HEALTHDMG <- cleanData$FATALITIES + cleanData$INJURIES
cleanData$ECONOMICC <- cleanData$PROPDMG + cleanData$CROPDMG
cleanData <- select(EVTYPE,FATALITIES,INJURIES,HEALTHDMG,
                    ECONOMICC,.data=cleanData)

This code is for data preview:

kable(head(cleanData),align = 'c')

This code is for event type clustering into commont types:

cleanData$EVTYPE <- toupper(cleanData$EVTYPE)
cleanData$EVTYPE[grep('.*FLOOD.*',cleanData$EVTYPE)] <- "FLOOD"
cleanData$EVTYPE[grep('.*FIRE.*',cleanData$EVTYPE)] <- "FIRE"
cleanData$EVTYPE[grep('.*DRY.*',cleanData$EVTYPE)] <- "HEAT"
cleanData$EVTYPE[grep('.*HIGH.*TEMPER.*',cleanData$EVTYPE)] <- "HEAT"
cleanData$EVTYPE[grep('.*WARM*',cleanData$EVTYPE)] <- "HEAT"
cleanData$EVTYPE[grep('.*HEAT*',cleanData$EVTYPE)] <- "HEAT"
cleanData$EVTYPE[grep('.*HAIL.*',cleanData$EVTYPE)] <- "HAIL"
cleanData$EVTYPE[grep('.*HURRICANE.*',cleanData$EVTYPE)] <- "HURRICANE"
cleanData$EVTYPE[grep('.*TORN.*',cleanData$EVTYPE)] <- "TORNADO"
cleanData$EVTYPE[grep('.*PRECIPITATION.*',cleanData$EVTYPE)] <- "RAIN"
cleanData$EVTYPE[grep('.*SHOWER.*',cleanData$EVTYPE)] <- "RAIN"
cleanData$EVTYPE[grep('.*RAIN.*',cleanData$EVTYPE)] <- "RAIN"
cleanData$EVTYPE[grep('.*LIGHT.*',cleanData$EVTYPE)] <- "STORM"
cleanData$EVTYPE[grep('.*STORM.*',cleanData$EVTYPE)] <- "STORM"
cleanData$EVTYPE[grep('.*WINTER*',cleanData$EVTYPE)] <- "SNOW"
cleanData$EVTYPE[grep('.*SNOW.*',cleanData$EVTYPE)] <- "SNOW"
cleanData$EVTYPE[grep('.*COLD*',cleanData$EVTYPE)] <- "COLD"
cleanData$EVTYPE[grep('.*LOW.*TEMPER.*',cleanData$EVTYPE)] <- "COLD"
cleanData$EVTYPE[grep('.*FROST.*',cleanData$EVTYPE)] <- "COLD"
cleanData$EVTYPE[grep('.*BLIZZARD.*',cleanData$EVTYPE)] <- "COLD"
cleanData$EVTYPE[grep('.*WIND.*',cleanData$EVTYPE)] <- "WIND"
cleanData$EVTYPE[grep('.*SUMMA.*',cleanData$EVTYPE)] <- "OTHER"
cleanData$EVTYPE[which(!(cleanData$EVTYPE %in% c("FLOOD", "FIRE", "HEAT", "HAIL", 
                                "HURRICANE", "TORNADO", "RAIN", 
                                "STORM", "SNOW", "COLD", "WIND", 
                                "OTHER")))] <- "OTHER"

Data ordering for plotting (seen previosly):

propertyDamageData <- cleanData %>%
                      group_by(EVTYPE) %>%
                      summarise(PROPDMG = sum(ECONOMICC)) %>%
                      arrange(PROPDMG)
propertyDamageData <- transform(propertyDamageData, 
                                EVTYPE = reorder(EVTYPE, PROPDMG))

healthDamageData   <- cleanData %>%
                      group_by(EVTYPE) %>%
                      summarise(HEALTHDMG = sum(HEALTHDMG)) %>%
                      arrange(HEALTHDMG)
healthDamageData <- transform(healthDamageData, 
                                EVTYPE = reorder(EVTYPE, HEALTHDMG))

Plot for property damage barchart:

p <- ggplot(data=propertyDamageData, aes(x=EVTYPE, y=PROPDMG, fill=EVTYPE)) +
    geom_bar(stat="identity", show.legend = FALSE) + 
    ylab("Economic Cost $") + 
    xlab("Event Type") + 
    coord_flip() + 
    ggtitle("Cost For Property And Crop Damages")
p

Plot for property public health damage barchart:

p <- ggplot(data=healthDamageData, aes(x=EVTYPE, y=HEALTHDMG, fill=EVTYPE)) +
    geom_bar(stat="identity", show.legend = FALSE) + 
    ylab("Public health damage cases") + 
    xlab("Event Type") + 
    coord_flip() + 
    ggtitle("Events Affecting Public Health")

p