Synopsis

In this report we aim to analyse the most dangerous weather events through US. To find out the most dangerous event we had to have data about weather events across the US. The data for the analysis came from U.S. National Oceanic and Atmospheric Administration. We found that tornadoe is the most dangerous event in US as for the human health and for property damage. However not all states in US suffer from tornadoes and that is why we found the most dangerous event for every state in US

Loading and processing data for analysis

Data for this analysis came from U.S. National Oceanic and Atmospheric Administration (NOAA). Database include information aboum major weather events as well as estimates of any fatalities, injuries and property damage caused by these events. Database contains events from 1950 till now.

Firs of all we have to download raw data and available documentation that comes with the data.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
url1 <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
url2 <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"

download.file(url = url, destfile = "Stormdata.csv.bz2")
download.file(url = url1, destfile = "Documentation.pdf", mode = "wb")
download.file(url = url2, destfile = "FAQ.pdf", mode = "wb")

First of all we read in the data from the downloaded stormdata.csv.bz2 file. File compressed and the data is comma-separated. First row of the data is column names, so we read it like a header.

stormdata <- read.table("Stormdata.csv.bz2", sep = ",", header = TRUE)
dim(stormdata)
## [1] 902297     37
str(stormdata$EVTYPE)
##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

The dataset has 37 columns and approximately 900 000 rows.

Event types in the downloaded dataframe looks quite messy. Moreover we have a lot of columns that is unnecessary for our future analysis. We have to clean it to perform future analysis. First af all we have to get rid of unnecessary columns ans select columns that have relation to our research.

From the structure of stormdata we can see that columns that can be interesting for us are:

BGN_DATE - date of the event

STATE - state where event happened

FATALITIES, INJURIES - number of death and injuries due to the event

PROPDMG, CROPDMG - damage caused by the event

library(dplyr)
stormdata <- select(stormdata, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
head(stormdata)
##             BGN_DATE STATE  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1  4/18/1950 0:00:00    AL TORNADO          0       15    25.0       0
## 2  4/18/1950 0:00:00    AL TORNADO          0        0     2.5       0
## 3  2/20/1951 0:00:00    AL TORNADO          0        2    25.0       0
## 4   6/8/1951 0:00:00    AL TORNADO          0        2     2.5       0
## 5 11/15/1951 0:00:00    AL TORNADO          0        2     2.5       0
## 6 11/15/1951 0:00:00    AL TORNADO          0        6     2.5       0

Now we need list of events to compare our EVTYPE variable, because it is highly unlikely that there are 985 event types. In the downloaded documentation we have list of events on page 6 and there we can see only 48 types of events. First of all we need to extract information about events tome from this pdf document.

library(pdftools)
text <- pdf_text("Documentation.pdf")
text2 <- strsplit(text, "\n")
Events <- text2[[6]]
Events <- Events[9:32]
Events <- gsub("(  )+", "\n", Events)
tmp <- strsplit(Events, "\n")
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- sapply(tmp, '[[', 3)
Events <- c(tmp1, tmp2)
Events
##  [1] "Astronomical Low Tide"    "Avalanche"               
##  [3] "Blizzard"                 "Coastal Flood"           
##  [5] "Cold/Wind Chill"          "Debris Flow"             
##  [7] "Dense Fog"                "Dense Smoke"             
##  [9] "Drought"                  "Dust Devil"              
## [11] "Dust Storm"               "Excessive Heat"          
## [13] "Extreme Cold/Wind Chill"  "Flash Flood"             
## [15] "Flood"                    "Frost/Freeze"            
## [17] "Funnel Cloud"             "Freezing Fog"            
## [19] "Hail"                     "Heat"                    
## [21] "Heavy Rain"               "Heavy Snow"              
## [23] "High Surf"                "High Wind"               
## [25] "Hurricane (Typhoon)"      "Ice Storm"               
## [27] "Lake-Effect Snow"         "Lakeshore Flood"         
## [29] "Lightning"                "Marine Hail"             
## [31] "Marine High Wind"         "Marine Strong Wind"      
## [33] "Marine Thunderstorm Wind" "Rip Current"             
## [35] "Seiche"                   "Sleet"                   
## [37] "Storm Surge/Tide"         "Strong Wind"             
## [39] "Thunderstorm Wind"        "Tornado"                 
## [41] "Tropical Depression"      "Tropical Storm"          
## [43] "Tsunami"                  "Volcanic Ash"            
## [45] "Waterspout"               "Wildfire"                
## [47] "Winter Storm"             "Winter Weather"

Now we have list of correct event names. Let’s look how many events in our dataframe look like events in the documentation.

sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 8.090462e-05

Less than 1%. It can be caused by the difference in capital and lower letters. Let’s make all event types in our dataframe and in our list of events lowercase.

stormdata$EVTYPE <- as.character(stormdata$EVTYPE)
stormdata$EVTYPE <- tolower(stormdata$EVTYPE)
Events <- tolower(Events)
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 0.7041484

70% of correct event types. Now it is much better, howewer in order to avoid big mistakes in our analysis we have to figure out what this 30% represent. If we look at the list of event types in our database we can see word “tstm” which is abbreviation of thunderstorm. Lets count the number of rows with this abbreviation.

length(grep("tstm", stormdata$EVTYPE))
## [1] 227236

Almost 230 000 rows - thats big part of the data. Now we change this abbreviation to full word thunderstorm.

stormdata$EVTYPE <- sub("tstm", "thunderstorm", stormdata$EVTYPE)
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 0.9547499

Now 95% of event types are in list of event types. Now we can perform our analysis and look at the results. However it can be possible that these 5% or poorly documented event types responsible for majority of death and injuries. Lets separate our dataset into 2 sets. One with proper event labeling and second with rest of poorely documented events and calculate total harm to human health in both sets.

stormdata.correct <- filter(stormdata, stormdata$EVTYPE %in% Events)
stormdata.messy <- filter(stormdata, !(stormdata$EVTYPE %in% Events))

After separation we can group our datasets by event type and calculate sum of fatalities and injuries across the US for every event type

by.country.correct <- stormdata.correct %>%
        group_by(EVTYPE) %>%
        summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
        filter(Death > quantile(Death, 0.7) | Inj > quantile(Inj, 0.7)) %>%
        arrange(desc(Death)) %>%
        print
## # A tibble: 18 x 3
##    EVTYPE                  Death   Inj
##    <chr>                   <dbl> <dbl>
##  1 tornado                  5633 91346
##  2 excessive heat           1903  6525
##  3 flash flood               978  1777
##  4 heat                      937  2100
##  5 lightning                 816  5230
##  6 thunderstorm wind         637  8445
##  7 flood                     470  6789
##  8 rip current               368   232
##  9 high wind                 248  1137
## 10 avalanche                 224   170
## 11 winter storm              206  1321
## 12 heavy snow                127  1021
## 13 extreme cold/wind chill   125    24
## 14 high surf                 104   156
## 15 blizzard                  101   805
## 16 ice storm                  89  1975
## 17 wildfire                   75   911
## 18 hail                       15  1361
by.country.messy <- stormdata.messy %>%
    group_by(EVTYPE) %>%
    summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
    filter(Death > quantile(Death, 0.7) | Inj > quantile(Inj, 0.7)) %>%
    arrange(desc(Death))%>%
    print
## # A tibble: 166 x 3
##    EVTYPE               Death   Inj
##    <chr>                <dbl> <dbl>
##  1 rip currents           204   297
##  2 heat wave              172   379
##  3 extreme cold           162   231
##  4 extreme heat            96   155
##  5 hurricane/typhoon       64  1275
##  6 thunderstorm winds      64   908
##  7 fog                     62   734
##  8 hurricane               61    46
##  9 heavy surf/high surf    42    48
## 10 cold                    38    48
## # ... with 156 more rows

Now we can see that in the dataset with messy event types the most harmful event is “rip currents”" that should be labelled as “rip current” (as we can see from list of correct event types). Looks like second harmful “heat wave” should be just “heat”, “extreme heat” should be “excessive heat”, “thunderstorm winds” should be “thunderstorm wind”, “heavy surf/high surf” should be “high surf” “high winds” should be “high wind”

“fog” should be “debse fog”and “hurricane” should be “hurricane (typhoon)”. For the last two event type we cant use sub command because it will substitute separate words fog and hurricane in correct event types and we can clearly see that words fog and hurricane can be related to one event only (for for dense for and hurricane for hurricane (typhoon)). “extreme cold” should be “extreme cold/wind chill” but just one word “cold” can be related to both “extreme cold/wind chill” and “Cold/Wind Chill”. We will leave it in messy data.

stormdata$EVTYPE <- sub("rip currents", "rip current", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("heat wave", "heat", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("extreme heat", "excessive heat", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("thunderstorm winds", "thunderstorm wind", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("heavy surf/high surf", "high surf", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("high winds", "high wind", stormdata$EVTYPE)

stormdata[grep("fog", stormdata$EVTYPE), 3] <- "dense fog"
stormdata[grep("hurricane", stormdata$EVTYPE), 3] <- "hurricane (typhoon)"
stormdata[grep("extreme cold", stormdata$EVTYPE), 3] <- "extreme cold/wind chill"
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 0.9819029

After substitutions our dataset contains only 2% of messy data. Lets separate our initial dataset again and look at the summary of correct and messy datasets

stormdata.correct <- filter(stormdata, stormdata$EVTYPE %in% Events)
stormdata.messy <- filter(stormdata, !(stormdata$EVTYPE %in% Events))
by.country.correct <- stormdata.correct %>%
        group_by(EVTYPE) %>%
        summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
        arrange(desc(Death)) %>%
        print
## # A tibble: 46 x 3
##    EVTYPE                  Death   Inj
##    <chr>                   <dbl> <dbl>
##  1 tornado                  5633 91346
##  2 excessive heat           1999  6680
##  3 heat                     1109  2479
##  4 flash flood               978  1777
##  5 lightning                 816  5230
##  6 thunderstorm wind         701  9353
##  7 rip current               572   529
##  8 flood                     470  6789
##  9 extreme cold/wind chill   287   255
## 10 high wind                 283  1439
## # ... with 36 more rows
by.country.messy <- stormdata.messy %>%
    group_by(EVTYPE) %>%
    summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
    filter(Death > quantile(Death, 0.7) | Inj > quantile(Inj, 0.7)) %>%
    arrange(desc(Death)) %>%
    print
## # A tibble: 147 x 3
##    EVTYPE                             Death   Inj
##    <chr>                              <dbl> <dbl>
##  1 cold                                  38    48
##  2 landslide                             38    52
##  3 unseasonably warm and dry             29     0
##  4 urban/sml stream fld                  28    79
##  5 winter weather/mix                    28    72
##  6 tornadoes, thunderstorm wind, hail    25     0
##  7 wind                                  23    86
##  8 flash flooding                        19     8
##  9 extreme windchill                     17     5
## 10 flood/flash flood                     17    15
## # ... with 137 more rows

Now our summary for messy data looks like we can make more substitution but it represents little bit less than 2% of data so we can just delete it for our analysis.

Results

In the following analysis we will use only data with correct event labels. It’s obvious that different states have different the most dangerous weather events. So to understand what is the most dangerous evnet in each state it’s worth to summarise storm data by states and show only event types that caused majority of fatalities.

Entire US analysis

by.country.correct <- stormdata.correct %>%
        group_by(EVTYPE) %>%
        summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
        arrange(desc(Death)) %>%
        print
## # A tibble: 46 x 3
##    EVTYPE                  Death   Inj
##    <chr>                   <dbl> <dbl>
##  1 tornado                  5633 91346
##  2 excessive heat           1999  6680
##  3 heat                     1109  2479
##  4 flash flood               978  1777
##  5 lightning                 816  5230
##  6 thunderstorm wind         701  9353
##  7 rip current               572   529
##  8 flood                     470  6789
##  9 extreme cold/wind chill   287   255
## 10 high wind                 283  1439
## # ... with 36 more rows

The plot for top 20 events that caused most damage to human health.

library(ggplot2)

g <- ggplot(by.country.correct[1:20,], aes(x = factor(EVTYPE), y = Death))
g + geom_histogram(stat = "identity", width = 0.5) + coord_flip() + labs(x = "Weather Event", y = "Number of fatalities")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

by.country.correct <- stormdata.correct %>%
        group_by(EVTYPE) %>%
        summarise(Damage.Property = sum(PROPDMG), Damage.Crop = sum(CROPDMG)) %>%
        filter(Damage.Property >0 | Damage.Crop >0) %>%
        arrange(desc(Damage.Property), desc(Damage.Crop)) %>%
        print
## # A tibble: 45 x 3
##    EVTYPE            Damage.Property Damage.Crop
##    <chr>                       <dbl>       <dbl>
##  1 tornado                  3212258.     100019.
##  2 thunderstorm wind        2659192.     194679.
##  3 flash flood              1420125.     179200.
##  4 flood                     899938.     168038.
##  5 hail                      688693.     579596.
##  6 lightning                 603352.       3581.
##  7 high wind                 380357.      19043.
##  8 winter storm              132721.       1979.
##  9 heavy snow                122252.       2166.
## 10 wildfire                   84459.       4364.
## # ... with 35 more rows

And the plot for top 20 events that caused most damage to property.

g <- ggplot(by.country.correct[1:20,], aes(x = factor(EVTYPE), y = Damage.Property))
g + geom_histogram(stat = "identity", width = 0.5) + coord_flip() + labs(x = "Weather Event", y = "Damage to property")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Every state in US analysis

Summary of the dataframe grouped by states.

by.states <- stormdata.correct %>%
        group_by(EVTYPE, STATE) %>%
        summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
        ungroup %>%
        group_by(STATE) %>%
        filter(Death == max(Death) & (Death > 0 | Inj > 0)) %>%
        arrange(desc(Death), desc(Inj)) %>%
        print(n = Inf)
## # A tibble: 65 x 4
## # Groups:   STATE [63]
##    EVTYPE                   STATE Death   Inj
##    <chr>                    <fct> <dbl> <dbl>
##  1 heat                     IL      653   242
##  2 tornado                  AL      617  7929
##  3 tornado                  TX      538  8207
##  4 tornado                  MS      450  6244
##  5 tornado                  MO      388  4330
##  6 tornado                  AR      379  5116
##  7 tornado                  TN      368  4748
##  8 excessive heat           PA      365   357
##  9 tornado                  OK      296  4829
## 10 rip current              FL      271   247
## 11 tornado                  IN      252  4224
## 12 tornado                  MI      243  3362
## 13 tornado                  KS      236  2721
## 14 tornado                  OH      191  4438
## 15 tornado                  GA      180  3926
## 16 tornado                  LA      153  2637
## 17 tornado                  NC      126  2536
## 18 tornado                  KY      125  2806
## 19 excessive heat           CA      110   260
## 20 tornado                  MA      108  1758
## 21 tornado                  MN       99  1976
## 22 tornado                  WI       96  1601
## 23 excessive heat           NY       93     1
## 24 excessive heat           MD       88   461
## 25 tornado                  IA       81  2208
## 26 flash flood              AZ       62   150
## 27 tornado                  SC       59  1314
## 28 tornado                  NE       54  1158
## 29 heat                     NV       54     0
## 30 lightning                CO       48   260
## 31 avalanche                CO       48    35
## 32 avalanche                UT       44    25
## 33 excessive heat           NJ       39   300
## 34 high wind                WA       38    75
## 35 rip current              GU       38    55
## 36 tornado                  VA       36   914
## 37 flash flood              PR       34     2
## 38 avalanche                AK       33    17
## 39 tsunami                  AS       32   129
## 40 high surf                HI       28    32
## 41 tornado                  ND       25   326
## 42 flash flood              WV       24    10
## 43 avalanche                WY       23    21
## 44 excessive heat           DC       20   316
## 45 high wind                OR       19    50
## 46 tornado                  SD       18   452
## 47 flash flood              NM       16    18
## 48 avalanche                ID       16     9
## 49 lightning                MT        9    30
## 50 marine thunderstorm wind AM        9    30
## 51 excessive heat           DE        7    51
## 52 thunderstorm wind        NH        7    33
## 53 marine thunderstorm wind AN        7     3
## 54 lightning                ME        6    70
## 55 high wind                CT        6    11
## 56 marine strong wind       PZ        5     3
## 57 flood                    VT        4     5
## 58 high surf                RI        3     0
## 59 high surf                VI        3     0
## 60 marine strong wind       LM        2     1
## 61 marine thunderstorm wind LM        2     1
## 62 marine strong wind       LS        1     0
## 63 marine strong wind       PH        1     0
## 64 marine thunderstorm wind GM        1     0
## 65 high surf                MH        0     1

Now we can clarly see that for a lot of the states tornado is the most harmful event type but for big part of the states another types of weather events cause majority of fatalities.

Now we can calculate damage to crops and property that was caused by different weather events.

by.states <- stormdata.correct %>%
        group_by(EVTYPE, STATE) %>%
        summarise(Damage.Property = sum(PROPDMG), Damage.Crop = sum(CROPDMG)) %>%
        ungroup %>%
        group_by(STATE) %>%
        filter(Damage.Property == max(Damage.Property) & (Damage.Property >0 | Damage.Crop >0)) %>%
        arrange(STATE) %>%
        print(n = Inf)
## # A tibble: 66 x 4
## # Groups:   STATE [66]
##    EVTYPE                   STATE Damage.Property Damage.Crop
##    <chr>                    <fct>           <dbl>       <dbl>
##  1 flood                    AK             10808.         8  
##  2 tornado                  AL            167816.      1653. 
##  3 waterspout               AM              5102          0  
##  4 marine thunderstorm wind AN               189          0  
##  5 tornado                  AR            119550.       388. 
##  6 hurricane (typhoon)      AS               610        602  
##  7 thunderstorm wind        AZ             42352        165  
##  8 high wind                CA             32545.      1483. 
##  9 tornado                  CO             18348.        21  
## 10 thunderstorm wind        CT              9457.         0  
## 11 thunderstorm wind        DC              2080.         0  
## 12 lightning                DE              5914          0  
## 13 tornado                  FL            159753.       148. 
## 14 tornado                  GA            151349.      3792. 
## 15 marine thunderstorm wind GM               395.         0  
## 16 tropical storm           GU              1779        241  
## 17 flash flood              HI              3162.       585  
## 18 thunderstorm wind        IA            154564.     22951. 
## 19 thunderstorm wind        ID             10562.        36  
## 20 tornado                  IL            128909.      1297. 
## 21 tornado                  IN            104686.       516  
## 22 tornado                  KS            143210.      5482. 
## 23 thunderstorm wind        KY             78842.      1472. 
## 24 tornado                  LA            131476.      2844  
## 25 marine thunderstorm wind LE                30          0  
## 26 marine thunderstorm wind LM              1210.         0  
## 27 marine thunderstorm wind LO                70          0  
## 28 marine thunderstorm wind LS               400          0  
## 29 flood                    MA             19044.         0  
## 30 thunderstorm wind        MD             21883.       158. 
## 31 flood                    ME             34350.         0  
## 32 high surf                MH                 5          0  
## 33 thunderstorm wind        MI             72839.      2293. 
## 34 tornado                  MN             74033.      4205. 
## 35 tornado                  MO            132160.      2286  
## 36 tornado                  MS            187841.     24964. 
## 37 thunderstorm wind        MT             16767.      5371. 
## 38 tornado                  NC             96280.      2039. 
## 39 tornado                  ND             49739.      4742  
## 40 tornado                  NE            110774.     14259. 
## 41 flood                    NH             18406.       200  
## 42 tornado                  NJ             14079.       200  
## 43 flash flood              NM             15792.      1384  
## 44 high wind                NV              8288.       100. 
## 45 thunderstorm wind        NY             85845.      1629  
## 46 thunderstorm wind        OH            165661.      1004  
## 47 tornado                  OK            165168.       607. 
## 48 high wind                OR              6072.         5  
## 49 thunderstorm wind        PA             67253.        30.5
## 50 marine high wind         PK                31          0  
## 51 flash flood              PR             16270.      1581  
## 52 marine strong wind       PZ                76          0  
## 53 thunderstorm wind        RI              2514.         0  
## 54 tornado                  SC             56730.       271. 
## 55 tornado                  SD             33512.       639. 
## 56 marine thunderstorm wind SL                15          0  
## 57 tornado                  TN            112161.       681  
## 58 tornado                  TX            283097.      4866. 
## 59 flash flood              UT              9271.       824. 
## 60 tornado                  VA             49360.       158  
## 61 flash flood              VI               561.         0  
## 62 flash flood              VT             19145.       739  
## 63 high wind                WA             20308.        93.2
## 64 tornado                  WI            111640.     11748. 
## 65 flash flood              WV             74637.      1250  
## 66 tornado                  WY              7657.         0