Exploring NOAA storm data base, to draw basic results

-author: “Georgios Mintzopoulos” -date: “Tuesday, September 16, 2014”

Synopsis

Storms and other severe weather events cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where hey occur, as well as estimates of any fatalities, injuries, and property damage.

We analyze the raw data to produce a tidy data set and perform two steps of exploratory analysis regarding impact of events on health and the costs associated with damages.

Data Processing

The data for this Coursera course assignment, come in the form of a comma-separated-value file named “repdata-data-StormData.csv”. In this analysis we start from the raw CSV file containing the data, which is supposed to be available.

We read the data into R environment.

raw_data=read.csv("repdata-data-StormData.csv",header=T,sep=",")

# the raw data file size is:
format(object.size(raw_data),units="Mb")
## [1] "397.7 Mb"

The raw data have the following structure,

str(raw_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We set the printing of numbers in R as below.

options(scipen = 12)   # set numbers printing      

Some more processing of the data follows, next. We show the annotated code and some print outs.

library(stringr)
# get the variables names
var.raw_data=names(raw_data)
# number the variables (columns) as a ref for subsetting
names(var.raw_data)=seq(1:length(var.raw_data))
# print the columns
var.raw_data
##            1            2            3            4            5 
##    "STATE__"   "BGN_DATE"   "BGN_TIME"  "TIME_ZONE"     "COUNTY" 
##            6            7            8            9           10 
## "COUNTYNAME"      "STATE"     "EVTYPE"  "BGN_RANGE"    "BGN_AZI" 
##           11           12           13           14           15 
## "BGN_LOCATI"   "END_DATE"   "END_TIME" "COUNTY_END" "COUNTYENDN" 
##           16           17           18           19           20 
##  "END_RANGE"    "END_AZI" "END_LOCATI"     "LENGTH"      "WIDTH" 
##           21           22           23           24           25 
##          "F"        "MAG" "FATALITIES"   "INJURIES"    "PROPDMG" 
##           26           27           28           29           30 
## "PROPDMGEXP"    "CROPDMG" "CROPDMGEXP"        "WFO" "STATEOFFIC" 
##           31           32           33           34           35 
##  "ZONENAMES"   "LATITUDE"  "LONGITUDE" "LATITUDE_E" "LONGITUDE_" 
##           36           37 
##    "REMARKS"     "REFNUM"
# subset the data to keep the columns of interest
data=subset(raw_data,select=c(8,23:28))
rm(raw_data)

# new data set has size:
format(object.size(data),units="Mb")
## [1] "37.9 Mb"
# new data structure
str(data)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
# get the variables names
var.data=names(data)
# number the variables (columns) as a ref for subsetting
names(var.data)=seq(1:length(var.data))
var.data
##            1            2            3            4            5 
##     "EVTYPE" "FATALITIES"   "INJURIES"    "PROPDMG" "PROPDMGEXP" 
##            6            7 
##    "CROPDMG" "CROPDMGEXP"

Our data have show many different storm events in the variable EVTYPE. We will use the defined from NOAA in our codebook 48 events We manually define a vector containing these events and we continue our data processing based on these defined events.

# We set the storm data acceptable event types as defined in our codebook
# this will be a vector containing events as strings
# ...we set all values to upper case, and trim whitespaces

# Storm Data Events taken from description
StormDataEventTable=(c("Astronomical Low Tide","Avalanche", "Blizzard", "Coastal Flood","Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke ","Drought ","Dust Devil ","Dust Storm ","Excessive Heat ","Extreme Cold/Wind Chill ","Flash Flood ","Flood ","Frost/Freeze ","Funnel Cloud ","Freezing Fog ","Hail ","Heat ","Heavy Rain ","Heavy Snow ","High Surf ","High Wind ","Hurricane (Typhoon) ","Ice Storm ","Lake-Effect Snow ","Lakeshore Flood ","Lightning ","Marine Hail ","Marine High Wind ","Marine Strong Wind ","Marine Thunderstorm Wind ","Rip Current ","Seiche ","Sleet ","Storm Surge/Tide ", "Strong Wind ","Thunderstorm Wind ","Tornado ","Tropical Depression ","Tropical Storm ","Tsunami ","Volcanic Ash ","Waterspout ","Wildfire ","Winter Storm ","Winter Weather" ))

# let's use this as a vector with the events types
events=as.vector(str_trim(toupper(StormDataEventTable),side="both"))
rm(StormDataEventTable)

# remove all white spaces from the values of data$EVTYPE
data$EVTYPE=as.character(str_trim(data$EVTYPE,side="both"))

# check out which observations in our data fall in the category of valid events and subset the data accordingly, producing a new data set called "valid_data"
valid_data=subset(data,data$EVTYPE%in%events)
rm(data)

We have now produced a subset of the original data set called valid_data containing the events described previously.

# the variables and types of the valid_data data set are:
sapply(valid_data,class)
##      EVTYPE  FATALITIES    INJURIES     PROPDMG  PROPDMGEXP     CROPDMG 
## "character"   "numeric"   "numeric"   "numeric"    "factor"   "numeric" 
##  CROPDMGEXP 
##    "factor"

We next examine the cost parameters in our data. There are two variables, called “PROPDMG” and “CROPDMG”, for property damages and crop damages respectively. For each of these variables there is a also a column containing a multiplier for its values. The multipliers found in our data are:

# set cash multipliers from property and crop damages
valid_data$PROPDMGEXP=as.character(valid_data$PROPDMGEXP)
valid_data$CROPDMGEXP=as.character(valid_data$CROPDMGEXP)

#property multipliers
PROPDMGEXP=unique(toupper(valid_data$PROPDMGEXP))
PROPDMGEXP
##  [1] "K" "M" ""  "B" "+" "0" "5" "2" "4" "7" "?" "-" "6" "3" "1" "8" "H"
# crop multipliers
CROPDMGEXP=unique(toupper(valid_data$CROPDMGEXP))
CROPDMGEXP
## [1] ""  "K" "M" "B" "0"
# CROPDMGEXP values are a subset of PROPDMGEXP values
CROPDMGEXP%in%PROPDMGEXP
## [1] TRUE TRUE TRUE TRUE TRUE

The multipliers are interpreted as following.. - H: hundred $ (10^2) - K: thousand $ (10^3) - M: million $ (10^6) - B: billion $ (10^9) - any number in {2,3,4,5,6,7,8} is a multiple; eg (DMG x number) - 0,1 and any other symbol is treated as dollars (no multiples)

Next we process the multipliers in R

# set the multipliers values
m0=data.frame(c("","+","?","-","1","0"),1) 
    names(m0)=c("symbol","multiplier")

m1=data.frame(c("2","3","4","5","6","7","8"),c(2,3,4,5,6,7,8))
    names(m1)=c("symbol","multiplier")

m2=data.frame(c("H","K","M","B"),c(10^2,10^3,10^6,10^9))
    names(m2)=c("symbol","multiplier")

M=rbind(m0,m1,m2)
M$symbol=as.character(M$symbol)
# The multipliers are 
M
##    symbol multiplier
## 1                  1
## 2       +          1
## 3       ?          1
## 4       -          1
## 5       1          1
## 6       0          1
## 7       2          2
## 8       3          3
## 9       4          4
## 10      5          5
## 11      6          6
## 12      7          7
## 13      8          8
## 14      H        100
## 15      K       1000
## 16      M    1000000
## 17      B 1000000000
rm(m0,m1,m2)

We now calculate our cost of damages as following:

# write a function to calculate the costs
    set_costs=function(vectorIn,multipliers)
            {
                for (i in 1:length(vectorIn))
                {
                    index=match(vectorIn[i],multipliers[,1])
                    vectorIn[i]=multipliers[index,2]
                }
            
                return (vectorIn)
            }

# calculate the costs for PROPDMG and CROPDMG
    valid_data$PROPDMGEXP=set_costs(valid_data$PROPDMGEXP,M)
    valid_data$PROPDMGEXP=as.numeric(valid_data$PROPDMGEXP)

    valid_data$CROPDMGEXP=set_costs(valid_data$CROPDMGEXP,M)
    valid_data$CROPDMGEXP=as.numeric(valid_data$CROPDMGEXP)

# calculate the two costs
valid_data$PROPDMG=valid_data$PROPDMG*valid_data$PROPDMGEXP
valid_data$CROPDMG=valid_data$CROPDMG*valid_data$CROPDMGEXP

Results

Next we will process our data doing some final steps in order to produce a couple of plots and draw some conclusions

#_________________________ draw some conclusions_________________________
        library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
        library(reshape2)
        library(ggplot2)
        library(scales)

# subset to keep only the variables of interest
valid_data=select(valid_data,1:4,6)

# the data size is now
format(object.size(valid_data),units="Mb")
## [1] "24.2 Mb"
# and has the variables
sapply(valid_data,class)
##      EVTYPE  FATALITIES    INJURIES     PROPDMG     CROPDMG 
## "character"   "numeric"   "numeric"   "numeric"   "numeric"
# fatalities and injuries are all complete cases
sum(is.na(valid_data$FATALITIES))
## [1] 0
sum(is.na(valid_data$INJURIES))
## [1] 0

The produced data that we will use in our plots are, “health” containing the health impacts (fatalities & injuries), and “cost” containing the damage impact (property & crop), per event type.

# aggreagated data are as following:
health=group_by(valid_data,EVTYPE)%.%summarize(Total_fatalities=sum(FATALITIES),Total_injuries=sum(INJURIES))
health=arrange(health,-Total_injuries)
health
## Source: local data frame [46 x 3]
## 
##                      EVTYPE Total_fatalities Total_injuries
## 1                   TORNADO             5633          91346
## 2                     FLOOD              470           6789
## 3            EXCESSIVE HEAT             1903           6525
## 4                 LIGHTNING              816           5230
## 5                      HEAT              937           2100
## 6                 ICE STORM               89           1975
## 7               FLASH FLOOD              978           1777
## 8         THUNDERSTORM WIND              133           1488
## 9                      HAIL               15           1361
## 10             WINTER STORM              206           1321
## 11                HIGH WIND              248           1137
## 12               HEAVY SNOW              127           1021
## 13                 WILDFIRE               75            911
## 14                 BLIZZARD              101            805
## 15               DUST STORM               22            440
## 16           WINTER WEATHER               33            398
## 17                DENSE FOG               18            342
## 18           TROPICAL STORM               58            340
## 19              STRONG WIND              103            280
## 20               HEAVY RAIN               98            251
## 21              RIP CURRENT              368            232
## 22                AVALANCHE              224            170
## 23                HIGH SURF              101            152
## 24                  TSUNAMI               33            129
## 25               DUST DEVIL                2             42
## 26               WATERSPOUT                3             29
## 27 MARINE THUNDERSTORM WIND               10             26
## 28  EXTREME COLD/WIND CHILL              125             24
## 29       MARINE STRONG WIND               14             22
## 30          COLD/WIND CHILL               95             12
## 31         STORM SURGE/TIDE               11              5
## 32                  DROUGHT                0              4
## 33             FUNNEL CLOUD                0              3
## 34            COASTAL FLOOD                3              2
## 35         MARINE HIGH WIND                1              1
## 36    ASTRONOMICAL LOW TIDE                0              0
## 37              DENSE SMOKE                0              0
## 38             FREEZING FOG                0              0
## 39             FROST/FREEZE                0              0
## 40         LAKE-EFFECT SNOW                0              0
## 41          LAKESHORE FLOOD                0              0
## 42              MARINE HAIL                0              0
## 43                   SEICHE                0              0
## 44                    SLEET                2              0
## 45      TROPICAL DEPRESSION                0              0
## 46             VOLCANIC ASH                0              0
cost=group_by(valid_data,EVTYPE)%.%summarize(Property_damage=sum(PROPDMG,na.rm=T),Crop_damage=sum(CROPDMG,na.rm=T))
cost=arrange(cost,-Property_damage)
cost
## Source: local data frame [46 x 3]
## 
##                      EVTYPE Property_damage Crop_damage
## 1                     FLOOD    144657709807  5661968450
## 2                   TORNADO     56925661188   414953270
## 3               FLASH FLOOD     16140862555  1421317100
## 4                      HAIL     15727367663  3025537473
## 5            TROPICAL STORM      7703890550   678346000
## 6              WINTER STORM      6688497251    26944000
## 7                 HIGH WIND      5270046295   638571300
## 8                  WILDFIRE      4765114000   295472800
## 9          STORM SURGE/TIDE      4641188000      850000
## 10                ICE STORM      3944927860  5022113500
## 11        THUNDERSTORM WIND      3483121296   414843050
## 12                  DROUGHT      1046106000 13972566000
## 13               HEAVY SNOW       932589149   134653100
## 14                LIGHTNING       928659516    12092090
## 15               HEAVY RAIN       694248090   733399800
## 16                 BLIZZARD       659213950   112060000
## 17            COASTAL FLOOD       237665560           0
## 18              STRONG WIND       175241450    64953500
## 19                  TSUNAMI       144062000       20000
## 20                HIGH SURF        89575000           0
## 21         LAKE-EFFECT SNOW        40115000           0
## 22           WINTER WEATHER        20866000    15000000
## 23                DENSE FOG         9674000           0
## 24             FROST/FREEZE         9480000  1094086000
## 25               WATERSPOUT         9353700           0
## 26  EXTREME COLD/WIND CHILL         8648000       50000
## 27           EXCESSIVE HEAT         7753700   492402000
## 28          LAKESHORE FLOOD         7540000           0
## 29               DUST STORM         5549000     3100000
## 30                AVALANCHE         3721800           0
## 31             FREEZING FOG         2182000           0
## 32          COLD/WIND CHILL         1990000      600000
## 33                     HEAT         1797000   401461500
## 34      TROPICAL DEPRESSION         1737000           0
## 35         MARINE HIGH WIND         1297010           0
## 36                   SEICHE          980000           0
## 37               DUST DEVIL          700330           0
## 38             VOLCANIC ASH          500000           0
## 39 MARINE THUNDERSTORM WIND          436400       50000
## 40       MARINE STRONG WIND          418330           0
## 41    ASTRONOMICAL LOW TIDE          320000           0
## 42             FUNNEL CLOUD          194600           0
## 43              DENSE SMOKE          100000           0
## 44              MARINE HAIL            4000           0
## 45              RIP CURRENT            1000           0
## 46                    SLEET               0           0

Next we draw our plots

# draw plotA: total fatalities and injuries per event type

# transform EVTYPE back to factor ordered by decreasing total number of Injuries/Property damage
health$EVTYPE=factor(health$EVTYPE,levels=health$EVTYPE[order(-health$Total_injuries)],ordered=T)
cost$EVTYPE=factor(cost$EVTYPE,levels=cost$EVTYPE[order(-cost$Property_damage)],ordered=T)

# use melt function to transform the data frame to long form for correct plotting
health_long=melt(health,id="EVTYPE",measure.vars=c("Total_fatalities","Total_injuries"))

plotA=ggplot(health_long,aes(EVTYPE,value,group=variable,fill=variable))+geom_bar(stat="identity")+xlab("Event type")+ylab("")+scale_y_continuous(labels = comma)+theme(axis.title.x = element_text(face="bold", colour="black", size=10),axis.text.x  = element_text(angle=90, vjust=0.5, size=8))

print(plotA)

plot of chunk unnamed-chunk-12

# draw plot2: total cost per event type

# transform data frame to long form
cost_long=melt(cost,id="EVTYPE",measure.vars=c("Property_damage","Crop_damage"))

plotB=ggplot(cost_long,aes(EVTYPE,value,group=variable,fill=variable))+geom_bar(stat="identity")+xlab("Event type")+ylab("USD")+scale_y_continuous(labels=comma)+theme(axis.title.x = element_text(face="bold", colour="black", size=10),axis.text.x  = element_text(angle=90, vjust=0.5, size=8))

print(plotB)

plot of chunk unnamed-chunk-12

From the two plots we can identify the events that have the highest impact on health and the highest damage costs.