Title: Analysis of the NOAA Storm Database and Finding the Severest Weather Events

Synopsis

The analysis answers the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The present analysis consists in the following steps:

Data Processing

Loading and reading the data into R:

url_df <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url_df,"repdata-data-StormData.csv.bz2")
df <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

Tidying up the EVTYPE variable:

number_uniqueEVTYPE <- length(unique(df$EVTYPE))
number_Summary <- length(grep("Summary", df$EVTYPE, ignore.case = TRUE))
number_TSTM <- length(grep("TSTM", df$EVTYPE, ignore.case = TRUE))

EVTYPE variable indicates the type of weather event. There are 985 unique EVTYPE variables which need to be converted to the 48 EVTYPE variables as set out by NOAA. You can find the 48 NOAA Event Type listed on page 6 of the National Weather Service Storm Data Documentation.

The 48 NOAA Event Types:

StormData <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", 
               "Debris Flow", "Dense Fog", "Dense Smoke","Drought", "Dust Devil", "Dust Storm", 
               "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", 
               "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", 
               "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", 
               "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
               "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", 
               "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", 
               "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", 
               "Winter Weather", "Summary")

Besides, there are 76 records related to the summary data. So the individual type for the summary data - “Summary” - has been created.
Finally, there are 227236 abbreviations of “THUNDERSTORM” - “TSTM”. These data have been transformed:

df$EVTYPE <- gsub("TSTM", "THUNDERSTORM", df$EVTYPE, ignore.case = TRUE)

To convert the EVTYPE variables to the 48 NOAA Event Types (+ Summary) we used the method of approximate string matching, namely string distances, as described in “An introduction to data cleaning with R” by Edwin de Jonge and Mark van der Loo:

A string distance is an algorithm or equation that indicates how much two strings differ from each other. An important distance measure is implemented by the R’s native adist function. This function counts how many basic operations are needed to turn one string into another.

D <- adist(df$EVTYPE, StormData, ignore.case = TRUE)
i <- apply(D, 1, which.min)
df$TYPE <- StormData[i]

The function adist returns the distance matrix D between the vector StormData and the EVTYPE column. Then we find the index of the smallest distance for each row of D and create a new column TYPE with the corresponding values of the NOAA Event Types.

Plotting the Pareto charts:

The analysis finds out the severest weather events. Pareto analysis is a likely technique for it. This technique helps to identify the root causes of the problems. Fundamentally, it estimates the benefit delivered by each cause, then selects a number of the most effective causes that deliver a total benefit reasonably close to the maximal possible one.

To find out types of events which are most harmful with respect to population health, firstly, we restricted to only events that have personal harm and summarized it for each group of types:

library("dplyr")
df_harm <- df %>% filter((FATALITIES + INJURIES) > 0) %>% 
    group_by(TYPE) %>% summarise(HARM = sum(FATALITIES + INJURIES))

Then we plotted the the Pareto chart using the package qcc.

library("qcc")
harm <- df_harm$HARM
names(harm) <- df_harm$TYPE
pareto.chart(harm, main = "Pareto chart for the most harmful events", ylab = "Total personal harm")
##                           
## Pareto chart analysis for harm
##                            Frequency Cum.Freq.   Percentage Cum.Percent.
##   Tornado                      96997     96997 62.308171616     62.30817
##   Thunderstorm Wind            10236    107233  6.575321347     68.88349
##   Excessive Heat                8723    115956  5.603412281     74.48691
##   Flood                         8528    124484  5.478149711     79.96505
##   Lightning                     6051    130535  3.886993891     83.85205
##   Heat                          3649    134184  2.344015982     86.19606
##   Flash Flood                   2846    137030  1.828191144     88.02426
##   Ice Storm                     2118    139148  1.360544218     89.38480
##   High Wind                     1824    140972  1.171686805     90.55649
##   Wildfire                      1824    142796  1.171686805     91.72817
##   Winter Storm                  1633    144429  1.048993724     92.77717
##   Hail                          1377    145806  0.884546453     93.66171
##   Hurricane (Typhoon)           1353    147159  0.869129521     94.53084
##   Heavy Snow                    1251    148410  0.803607562     95.33445
##   Rip Current                   1106    149516  0.710463600     96.04491
##   Blizzard                       906    150422  0.581989170     96.62690
##   Winter Weather                 606    151028  0.389277524     97.01618
##   Dust Storm                     462    151490  0.296775934     97.31296
##   Tropical Storm                 449    151939  0.288425096     97.60138
##   Strong Wind                    442    152381  0.283928491     97.88531
##   Avalanche                      420    152801  0.269796304     98.15511
##   High Surf                      408    153209  0.262087838     98.41720
##   Funnel Cloud                   396    153605  0.254379372     98.67157
##   Heavy Rain                     386    153991  0.247955651     98.91953
##   Dense Fog                      362    154353  0.232538719     99.15207
##   Seiche                         256    154609  0.164447271     99.31652
##   Extreme Cold/Wind Chill        171    154780  0.109845638     99.42636
##   Tsunami                        162    154942  0.104064289     99.53043
##   Coastal Flood                  130    155072  0.083508380     99.61393
##   Cold/Wind Chill                121    155193  0.077727030     99.69166
##   Waterspout                      93    155286  0.059740610     99.75140
##   Marine Thunderstorm Wind        79    155365  0.050747400     99.80215
##   Drought                         68    155433  0.043681306     99.84583
##   Storm Surge/Tide                67    155500  0.043038934     99.88887
##   Freezing Fog                    51    155551  0.032760980     99.92163
##   Dust Devil                      45    155596  0.028906747     99.95054
##   Marine Strong Wind              37    155633  0.023767770     99.97431
##   Marine Hail                     26    155659  0.016701676     99.99101
##   Marine High Wind                 6    155665  0.003854233     99.99486
##   Sleet                            6    155671  0.003854233     99.99872
##   Tropical Depression              2    155673  0.001284744    100.00000
abline(h=(sum(harm)*.8),col="red",lwd=1)

The function pareto.chart returns a table containing the descriptive statistics used to draw the Pareto chart. We also added a line - 80% of the total harm, for identify the severest events.

To find out types of events which have the greatest economic consequences, firstly, we restricted to only events that have dollar damage and summarized it for each group of types:

df_damage <- df %>% filter((PROPDMG + CROPDMG) > 0) %>% group_by(TYPE) %>% 
    summarise(DMG = sum(ifelse(PROPDMGEXP=="K",PROPDMG*10^3,
                               ifelse(PROPDMGEXP=="M",PROPDMG*10^6,
                                      ifelse(PROPDMGEXP=="B",PROPDMG*10^9,PROPDMG))) +
                            ifelse(CROPDMGEXP=="K",CROPDMG*10^3,
                                   ifelse(CROPDMGEXP=="M",CROPDMG*10^6,
                                          ifelse(CROPDMGEXP=="B",CROPDMG*10^9,CROPDMG)))))

The variables PROPDMGEXP and CROPDMGEXP are the units for the PROPDMG and CROPDMG variables. Their values K, M, and B mean thousands, millions, and billions.

Plotting the Pareto chart:

damage <- df_damage$DMG
names(damage) <- df_damage$TYPE
pareto.chart(damage, ylab = "Total damage", main = "Pareto chart for the most damaging events")
##                           
## Pareto chart analysis for damage
##                               Frequency    Cum.Freq.   Percentage
##   Flood                    151264401313 151264401313 3.175332e+01
##   Hurricane (Typhoon)       75471243830 226735645143 1.584287e+01
##   Tornado                   57345497660 284081142803 1.203793e+01
##   Storm Surge/Tide          47965834000 332046976803 1.006895e+01
##   Flash Flood               28862068967 360909045770 6.058706e+00
##   Hail                      18760322493 379669368263 3.938154e+00
##   Drought                   15025751600 394695119863 3.154195e+00
##   Seiche                    14625479310 409320599173 3.070171e+00
##   Thunderstorm Wind         11050921073 420371520245 2.319802e+00
##   Wildfire                   9240150185 429611670430 1.939686e+00
##   Ice Storm                  9210099560 438821769990 1.933378e+00
##   Tropical Storm             8409291550 447231061540 1.765273e+00
##   Winter Storm               6717503751 453948565291 1.410134e+00
##   High Wind                  6680148743 460628714034 1.402292e+00
##   Marine Thunderstorm Wind   2862652400 463491366434 6.009260e-01
##   Winter Weather             2542298000 466033664434 5.336775e-01
##   Heavy Rain                 1562007442 467595671876 3.278955e-01
##   Frost/Freeze               1456916000 469052587876 3.058348e-01
##   Funnel Cloud               1380905000 470433492876 2.898786e-01
##   Heavy Snow                 1087550752 471521043628 2.282979e-01
##   Lightning                   948304537 472469348165 1.990674e-01
##   Blizzard                    771393950 473240742115 1.619305e-01
##   Excessive Heat              649211480 473889953595 1.362820e-01
##   Coastal Flood               563965560 474453919155 1.183873e-01
##   Sleet                       458485000 474912404155 9.624486e-02
##   Heat                        421360450 475333764605 8.845170e-02
##   Strong Wind                 251577740 475585342345 5.281103e-02
##   Tsunami                     144082000 475729424345 3.024559e-02
##   Marine High Wind            112817010 475842241355 2.368247e-02
##   High Surf                   101555500 475943796855 2.131846e-02
##   Marine Hail                  70917000 476014713855 1.488685e-02
##   Cold/Wind Chill              68600000 476083313855 1.440046e-02
##   Dense Fog                    65774000 476149087855 1.380723e-02
##   Waterspout                   60777200 476209865055 1.275831e-02
##   Debris Flow                  42050000 476251915055 8.827107e-03
##   Lake-Effect Snow             40362000 476292277055 8.472763e-03
##   Extreme Cold/Wind Chill      26503000 476318780055 5.563492e-03
##   Freezing Fog                 13554500 476332334555 2.845351e-03
##   Dust Storm                    9799100 476342133655 2.057020e-03
##   Astronomical Low Tide         9745000 476351878655 2.045664e-03
##   Avalanche                     8721800 476360600455 1.830874e-03
##   Lakeshore Flood               7545000 476368145455 1.583841e-03
##   Marine Strong Wind            2118330 476370263785 4.446784e-04
##   Tropical Depression           1737000 476372000785 3.646298e-04
##   Dust Devil                     743130 476372743915 1.559973e-04
##   Volcanic Ash                   500000 476373243915 1.049597e-04
##   Rip Current                    163000 476373406915 3.421685e-05
##   Dense Smoke                    100050 476373506965 2.100243e-05
##                           
## Pareto chart analysis for damage
##                            Cum.Percent.
##   Flood                        31.75332
##   Hurricane (Typhoon)          47.59619
##   Tornado                      59.63412
##   Storm Surge/Tide             69.70307
##   Flash Flood                  75.76178
##   Hail                         79.69993
##   Drought                      82.85413
##   Seiche                       85.92430
##   Thunderstorm Wind            88.24410
##   Wildfire                     90.18379
##   Ice Storm                    92.11717
##   Tropical Storm               93.88244
##   Winter Storm                 95.29257
##   High Wind                    96.69486
##   Marine Thunderstorm Wind     97.29579
##   Winter Weather               97.82947
##   Heavy Rain                   98.15736
##   Frost/Freeze                 98.46320
##   Funnel Cloud                 98.75308
##   Heavy Snow                   98.98137
##   Lightning                    99.18044
##   Blizzard                     99.34237
##   Excessive Heat               99.47865
##   Coastal Flood                99.59704
##   Sleet                        99.69329
##   Heat                         99.78174
##   Strong Wind                  99.83455
##   Tsunami                      99.86479
##   Marine High Wind             99.88848
##   High Surf                    99.90980
##   Marine Hail                  99.92468
##   Cold/Wind Chill              99.93908
##   Dense Fog                    99.95289
##   Waterspout                   99.96565
##   Debris Flow                  99.97448
##   Lake-Effect Snow             99.98295
##   Extreme Cold/Wind Chill      99.98851
##   Freezing Fog                 99.99136
##   Dust Storm                   99.99341
##   Astronomical Low Tide        99.99546
##   Avalanche                    99.99729
##   Lakeshore Flood              99.99887
##   Marine Strong Wind           99.99932
##   Tropical Depression          99.99968
##   Dust Devil                   99.99984
##   Volcanic Ash                 99.99994
##   Rip Current                  99.99998
##   Dense Smoke                 100.00000
abline(h=(sum(damage)*.8),col="red",lwd=1)

Results

According to the Pareto principle roughly 80% of the effects come from 20% of the causes. So taking a look at the Pareto charts, we make up the following conclusions:

  1. The events which are most harmful to population health are:
  1. The events which have the greatest economic consequences are: