Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

Downloading the data

library(R.utils)
library(lattice)
filename='StormData.csv.bz2'
content='StormData.csv'

if (!file.exists(filename) & !file.exists(content))
{
    download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', filename, method='curl')
}

# Extracting the archive
if (file.exists(filename) & !file.exists(content))
{
    bunzip2(filename, content, remove = FALSE, skip = TRUE)
}

Loading the data in a dataframe

Now, let’s have a first look at the data.

df1 <- read.csv(content)
head(df1)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6
names(df1)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
df1[1000,]
##      STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1000       1 1/13/1972 0:00:00     0215       CST     67      HENRY    AL
##       EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1000 TORNADO         0                                               0
##      COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1000         NA         0                       8.4   200 3   0          0
##      INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1000        2     250          K       0                                    
##      LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1000     3136      8524       3143       8522           1000

Dealing with the missing values

Checking how many rows have missing values.

Rows with missing fatalities:

nrow(df1[(is.na(df1$FATALITIES) | df1$FATALITIES == ''),])
## [1] 0

Rows with missing injuries:

nrow(df1[(is.na(df1$INJURIES) | df1$INJURIES == ''),])
## [1] 0

So as we can see, there are no rows with missing values.

Grouping similar event types together

Let’s have a look at all the types of events that this dataset contains.

unique(df1$EVTYPE)
# Results are hidden because of the huge size

This suggests that some event types are the same but due to a semantic difference they appear to be different in this dataframe. We can observe a few examples:

  1. “RIP CURRENT” and “RIP CURRENTS”
  2. “WILDFIRE” and “WILD/FOREST FIRE”
  3. “HIGH SURF” and “HEAVY SURF/HIGH SURF”

Now let’s fix these by grouping these events together under one event type.

df1[(df1$EVTYPE == 'RIP CURRENTS'),]$EVTYPE <- 'RIP CURRENT'
df1[(df1$EVTYPE == 'WILD/FOREST FIRE'),]$EVTYPE <- 'WILDFIRE'
df1[(df1$EVTYPE == 'HEAVY SURF/HIGH SURF'),]$EVTYPE <- 'HIGH SURF'

Exploratory Data Analysis

Grouping events by fatalities and injuries

Now let’s create a new dataframe by grouping EVTYPE, FATALITIES and INJURIES together, which will give us a better understanding of the impact of all events.

df2 <- aggregate(list(Fatalities=df1$FATALITIES, Injuries=df1$INJURIES), by=list(Event=df1$EVTYPE), FUN=sum)
head(df2)
##                   Event Fatalities Injuries
## 1    HIGH SURF ADVISORY          0        0
## 2         COASTAL FLOOD          0        0
## 3           FLASH FLOOD          0        0
## 4             LIGHTNING          0        0
## 5             TSTM WIND          0        0
## 6       TSTM WIND (G45)          0        0

Now that we have the impact on the population of the events, we can sort the dataframe by the number of fatalities and injuries in the order respectively.

df3 <- df2[order(-df2$Fatalities, -df2$Injuries),]

# Events with highest fatalities
df3[1:20,]
##                       Event Fatalities Injuries
## 832                 TORNADO       5633    91346
## 130          EXCESSIVE HEAT       1903     6525
## 153             FLASH FLOOD        978     1777
## 275                    HEAT        937     2100
## 463               LIGHTNING        816     5230
## 584             RIP CURRENT        572      529
## 854               TSTM WIND        504     6957
## 170                   FLOOD        470     6789
## 358               HIGH WIND        248     1137
## 19                AVALANCHE        224      170
## 969            WINTER STORM        206     1321
## 278               HEAT WAVE        172      309
## 140            EXTREME COLD        160      231
## 349               HIGH SURF        143      200
## 758       THUNDERSTORM WIND        133     1488
## 310              HEAVY SNOW        127     1021
## 141 EXTREME COLD/WIND CHILL        125       24
## 674             STRONG WIND        103      280
## 30                 BLIZZARD        101      805
## 290              HEAVY RAIN         98      251

Now let’s draw a bar plot for the events with the highest fatalities.

plot1 <- barchart(Fatalities + Injuries ~ Event, data = df3[1:10,],
    key = list(
        space = "right",
        text = list(c("Fatalities", "Injuries"), col = 'black'),
        rectangles = list(col = c("dodgerblue", "salmon"))
    ),
    main = "Fatalities and Injuries per event", xlab = "Event", ylab = "Count",
    scales = list(x = list(rot = 45)),
    col = c("dodgerblue", "salmon")
)

print(plot1)

Events that cause the highest impact per event

Firstly, we have to assume that fatalities are more impactful than injuries, let’s assume in this case that fatalities are twice more impactful.

Now let’s create a new column for calculating total impact per event.

# Create a new column for number of total events
df3$Total_Events <- sapply(df3$Event, function(event_s){
    nrow(df1[(df1$EVTYPE == event_s),])
})


df3$Impact_per_Event <- (df3$Fatalities * 2 + df3$Injuries) / df3$Total_Events
head(df3, 10)
##              Event Fatalities Injuries Total_Events Impact_per_Event
## 832        TORNADO       5633    91346        60652       1.69181560
## 130 EXCESSIVE HEAT       1903     6525         1678       6.15673421
## 153    FLASH FLOOD        978     1777        54277       0.06877683
## 275           HEAT        937     2100          767       5.18122555
## 463      LIGHTNING        816     5230        15754       0.43557192
## 584    RIP CURRENT        572      529          774       2.16149871
## 854      TSTM WIND        504     6957       219940       0.03621442
## 170          FLOOD        470     6789        25326       0.30518045
## 358      HIGH WIND        248     1137        20212       0.08079359
## 19       AVALANCHE        224      170          386       1.60103627

Now let’s sort the dataframe wrt the Impact_per_Event column. Also, we have to assume that if the data is not available for more than 100 events than that data is not appropriate for impact analysis.

# Sorting data in the order of highest impact per event
df3 <- df3[(order(-df3$Impact_per_Event)),]

# Number of events should be higher than 100
df4 <- df3[(df3$Total_Events > 100),]

head(df4, 20)
##                 Event Fatalities Injuries Total_Events Impact_per_Event
## 130    EXCESSIVE HEAT       1903     6525         1678        6.1567342
## 275              HEAT        937     2100          767        5.1812256
## 584       RIP CURRENT        572      529          774        2.1614987
## 832           TORNADO       5633    91346        60652        1.6918156
## 19          AVALANCHE        224      170          386        1.6010363
## 188               FOG         62      734          538        1.5947955
## 117        DUST STORM         22      440          427        1.1334895
## 426         ICE STORM         89     1975         2006        1.0732802
## 401         HURRICANE         61       46          174        0.9655172
## 140      EXTREME COLD        160      231          655        0.8412214
## 846    TROPICAL STORM         58      340          690        0.6608696
## 349         HIGH SURF        143      200          953        0.5099685
## 463         LIGHTNING        816     5230        15754        0.4355719
## 957              WIND         23       86          340        0.3882353
## 954          WILDFIRE         87     1456         4218        0.3864391
## 79    COLD/WIND CHILL         95       12          539        0.3747681
## 30           BLIZZARD        101      805         2719        0.3703567
## 115        DUST DEVIL          2       42          141        0.3262411
## 884 UNSEASONABLY WARM         11       17          126        0.3095238
## 170             FLOOD        470     6789        25326        0.3051804

Barplot of impact per event

Now let’s draw a barplot of events which cause the highest impact.

plot2 <- barchart(Impact_per_Event ~ Event, data = df4[1:10,],
    main = "Total Impact per event", xlab = "Event", ylab = "Total Impact",
    scales = list(x = list(rot = 45)),
    col = "dodgerblue"
)

print(plot2)

Economic Damage

Now let’s create a new dataframe which states the total economic damage per event type.

df5 <- aggregate(list(Total_Economic_Damage=df1$PROPDMG + df1$CROPDMG), by=list(Event=df1$EVTYPE), FUN=sum)
head(df5)
##                   Event Total_Economic_Damage
## 1    HIGH SURF ADVISORY                   200
## 2         COASTAL FLOOD                     0
## 3           FLASH FLOOD                    50
## 4             LIGHTNING                     0
## 5             TSTM WIND                   108
## 6       TSTM WIND (G45)                     8
# Now sorting the dataframe by economic damage
df5 <- df5[(order(-df5$Total_Economic_Damage)),]
head(df5)
##                 Event Total_Economic_Damage
## 832           TORNADO             3312276.7
## 153       FLASH FLOOD             1599325.1
## 854         TSTM WIND             1445168.2
## 244              HAIL             1268289.7
## 170             FLOOD             1067976.4
## 758 THUNDERSTORM WIND              943635.6

Economic damage per event by event type

Now, we’ll see what kind of events are likely to cause the highest economic damage in an event.

df5$Total_Events <- sapply(df5$Event, function(event_s){
    nrow(df1[(df1$EVTYPE == event_s),])
})

df5$Economic_Impact <- df5$Total_Economic_Damage / df5$Total_Events

# Now sort the dataframe by economic damage per event
df5 <- df5[(order(-df5$Economic_Impact)),]

# Number of events must be greater than 100
df5 <- df5[(df5$Total_Events > 100),]
head(df5, 10)
##                Event Total_Economic_Damage Total_Events Economic_Impact
## 401        HURRICANE              20852.99          174       119.84477
## 588      RIVER FLOOD              17345.70          173       100.26416
## 846   TROPICAL STORM              54322.80          690        78.72870
## 668      STORM SURGE              19398.49          261        74.32372
## 185         FLOODING               8824.90          120        73.54083
## 903      URBAN FLOOD              14216.50          249        57.09438
## 832          TORNADO            3312276.68        60652        54.61117
## 669 STORM SURGE/TIDE               7627.05          148        51.53412
## 164   FLASH FLOODING              33623.20          682        49.30088
## 170            FLOOD            1067976.36        25326        42.16917

Now let’s draw a barplot for economic damage per event type.

plot3 <- barchart(Economic_Impact ~ Event, data = df5[1:10,],
    main = "Average Economic Impact per event", xlab = "Event", ylab = "Economic Impact per event",
    scales = list(x = list(rot = 45)),
    col = "dodgerblue"
)

print(plot3)

Results

  1. In terms of impact per event, excessive heat event is the most harmful to the population.
  2. In terms of total fatalities and injuries, tornado is the most harmful.
  3. Hurricanes cause the highest amount of economic damage per event.