Synopsis

In this dataset, we’ll explore records of major storm data for the past 60 years (1950 - 2011) in United States to understand its impact to human, in general. The dataset that is used came from U.S. National Oceanic and Atmospheric Administration’s (NOAA). You can download this dataset here

US Storm Data

Generally, we’ll explore fatalaties, injuries and property damages caused by major storms and weather events.

Data Processing

R is used as the tool to process this dataset.

library(R.utils)
library(dplyr)

Obtaining the File and Preprocessing

  1. The file is downloaded from the source link.
  2. The file will be unzip and placed into the working directory.

Alternatively, data can be loaded in R using the following scripts

# Download file from URL
if (!file.exists("c:/coursera/storm.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "c:/coursera/storm.bz2")
}

# Unzip file
if (!file.exists("c:/coursera/storm.csv")) {
    bunzip2("c:/coursera/storm.csv.bz2", "c:/coursera/storm.csv", remove = FALSE)
}

Loading Data

Load data into a data frame.

# Load data into R
data <- read.csv("c:/coursera/storm.csv")

Review Data

Take a look at the structure of the dataset.

# Take a look at the data
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Preparing Data Subset

# Subset the data
# Gather only the necessary columns
ReqCols <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm <- data[ReqCols]

# Remove original dataset to save memory
rm(data)

Preparing Data - Property Damage

The exponent will be processed and combined with property damage value.

# Explore exponent values
unique(storm$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# Sorting out property exponent data
storm$propexp[storm$PROPDMGEXP == "B"] <- 1e+09
storm$propexp[storm$PROPDMGEXP == "M"] <- 1e+06
storm$propexp[storm$PROPDMGEXP == "m"] <- 1e+06
storm$propexp[storm$PROPDMGEXP == "K"] <- 1000
storm$propexp[storm$PROPDMGEXP == "H"] <- 100
storm$propexp[storm$PROPDMGEXP == "h"] <- 100
storm$propexp[storm$PROPDMGEXP == ""] <- 1
storm$propexp[storm$PROPDMGEXP == "8"] <- 1e+08
storm$propexp[storm$PROPDMGEXP == "7"] <- 1e+07
storm$propexp[storm$PROPDMGEXP == "6"] <- 1e+06
storm$propexp[storm$PROPDMGEXP == "5"] <- 1e+05
storm$propexp[storm$PROPDMGEXP == "4"] <- 1e+04
storm$propexp[storm$PROPDMGEXP == "3"] <- 1000
storm$propexp[storm$PROPDMGEXP == "2"] <- 100
storm$propexp[storm$PROPDMGEXP == "1"] <- 10
storm$propexp[storm$PROPDMGEXP == "0"] <- 1

# Exclude invalid entries by assigning 0
storm$propexp[storm$PROPDMGEXP == "+"] <- 0
storm$propexp[storm$PROPDMGEXP == "-"] <- 0
storm$propexp[storm$PROPDMGEXP == "?"] <- 0

# Calculate property damage value
storm$PROPDMGvalue <- storm$PROPDMG * storm$propexp

Preparing Data - Crop Damage

The exponent will be processed and combined with crop damage value.

# Explore exponent values
unique(storm$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M
# Sorting out property exponent data
storm$cropexp[storm$CROPDMGEXP == "B"] <- 1e+09
storm$cropexp[storm$CROPDMGEXP == "M"] <- 1e+06
storm$cropexp[storm$CROPDMGEXP == "m"] <- 1e+06
storm$cropexp[storm$CROPDMGEXP == "K"] <- 1000
storm$cropexp[storm$CROPDMGEXP == "k"] <- 1000
storm$cropexp[storm$CROPDMGEXP == ""] <- 1
storm$cropexp[storm$CROPDMGEXP == "2"] <- 100
storm$cropexp[storm$CROPDMGEXP == "0"] <- 1

# Exclude invalid entries by assigning 0
storm$cropexp[storm$CROPDMGEXP == "?"] <- 0

# Calculate property damage value
storm$CROPDMGvalue <- storm$CROPDMG * storm$cropexp

Aggregate Data

Count Total Impact to Human (Fatalities + Injuries) & Total Economic Impact (Property Damage + Crop Damage)

storm$HumanImpact <- storm$FATALITIES + storm$INJURIES
storm$TotalDamage <- storm$PROPDMGvalue + storm$CROPDMGvalue

To see the biggest impact of weather events, the top 10 rows will be taken for each type of impact

Generate Top 10 Lists of Highest Human Impact

# Top 10 Fatalities
TopFatalities <- storm %>% group_by(EVTYPE) %>% summarise(Fatalities=sum(FATALITIES)) %>% top_n(10) %>% arrange(desc(Fatalities))
## Selecting by Fatalities
print(TopFatalities)
## Source: local data frame [10 x 2]
## 
##            EVTYPE Fatalities
## 1         TORNADO       5633
## 2  EXCESSIVE HEAT       1903
## 3     FLASH FLOOD        978
## 4            HEAT        937
## 5       LIGHTNING        816
## 6       TSTM WIND        504
## 7           FLOOD        470
## 8     RIP CURRENT        368
## 9       HIGH WIND        248
## 10      AVALANCHE        224
# Top 10 Injuries
TopInjuries <- storm %>% group_by(EVTYPE) %>% summarise(Injuries=sum(INJURIES)) %>% top_n(10) %>% arrange(desc(Injuries))
## Selecting by Injuries
print(TopInjuries)
## Source: local data frame [10 x 2]
## 
##               EVTYPE Injuries
## 1            TORNADO    91346
## 2          TSTM WIND     6957
## 3              FLOOD     6789
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1777
## 9  THUNDERSTORM WIND     1488
## 10              HAIL     1361
# Top 10 Total Human Impact
TopTotalHumanImpact <- storm %>% group_by(EVTYPE) %>% summarise(HumanImpact=sum(HumanImpact)) %>% top_n(10) %>% arrange(desc(HumanImpact))
## Selecting by HumanImpact
print(TopTotalHumanImpact)
## Source: local data frame [10 x 2]
## 
##               EVTYPE HumanImpact
## 1            TORNADO       96979
## 2     EXCESSIVE HEAT        8428
## 3          TSTM WIND        7461
## 4              FLOOD        7259
## 5          LIGHTNING        6046
## 6               HEAT        3037
## 7        FLASH FLOOD        2755
## 8          ICE STORM        2064
## 9  THUNDERSTORM WIND        1621
## 10      WINTER STORM        1527

Generate Top 10 Lists of Highest Economic Impact

# Top 10 Property Damage
TopPropertyDamage <- storm %>% group_by(EVTYPE) %>% summarise(PropertyDamage=sum(PROPDMGvalue)) %>% top_n(10) %>% arrange(desc(PropertyDamage))
## Selecting by PropertyDamage
print(TopPropertyDamage)
## Source: local data frame [10 x 2]
## 
##               EVTYPE PropertyDamage
## 1              FLOOD   144657709807
## 2  HURRICANE/TYPHOON    69305840000
## 3            TORNADO    56947380617
## 4        STORM SURGE    43323536000
## 5        FLASH FLOOD    16822673979
## 6               HAIL    15735267513
## 7          HURRICANE    11868319010
## 8     TROPICAL STORM     7703890550
## 9       WINTER STORM     6688497251
## 10         HIGH WIND     5270046260
# Top 10 Crop Damage
TopCropDamage <- storm %>% group_by(EVTYPE) %>% summarise(CropDamage=sum(CROPDMGvalue)) %>% top_n(10) %>% arrange(desc(CropDamage))
## Selecting by CropDamage
print(TopCropDamage)
## Source: local data frame [10 x 2]
## 
##               EVTYPE  CropDamage
## 1            DROUGHT 13972566000
## 2              FLOOD  5661968450
## 3        RIVER FLOOD  5029459000
## 4          ICE STORM  5022113500
## 5               HAIL  3025954473
## 6          HURRICANE  2741910000
## 7  HURRICANE/TYPHOON  2607872800
## 8        FLASH FLOOD  1421317100
## 9       EXTREME COLD  1292973000
## 10      FROST/FREEZE  1094086000
# Top 10 Total Damage
TopTotalDamage <- storm %>% group_by(EVTYPE) %>% summarise(TotalDamage=sum(TotalDamage)) %>% top_n(10) %>% arrange(desc(TotalDamage))
## Selecting by TotalDamage
print(TopTotalDamage)
## Source: local data frame [10 x 2]
## 
##               EVTYPE  TotalDamage
## 1              FLOOD 150319678257
## 2  HURRICANE/TYPHOON  71913712800
## 3            TORNADO  57362333887
## 4        STORM SURGE  43323541000
## 5               HAIL  18761221986
## 6        FLASH FLOOD  18243991079
## 7            DROUGHT  15018672000
## 8          HURRICANE  14610229010
## 9        RIVER FLOOD  10148404500
## 10         ICE STORM   8967041360

Result

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

par(mfrow = c(1, 3), mar = c(11, 2, 3, 1), mgp = c(3, 1, 0), cex = 0.6)

barplot(TopFatalities$Fatalities, las = 3, names.arg = TopFatalities$EVTYPE, main = "Top 10 Highest Fatalities", ylab = "Number of Fatalities", col = "blue")

barplot(TopInjuries$Injuries, las = 3, names.arg = TopInjuries$EVTYPE, main = "Top 10 Highest Injuries", ylab = "Number of Injuries", col = "blue")

barplot(TopTotalHumanImpact$HumanImpact, las = 3, names.arg = TopTotalHumanImpact$EVTYPE, main = "Top 10 Highest Human Impact", ylab = "Total Fatalities & Injuries", col = "grey")

Summary

Tornado is the highest impact weather phenomenon in the United States to the human population.

Across the United States, which types of events have the greatest economic consequences?

par(mfrow = c(1, 3), mar = c(11, 2, 3, 1), mgp = c(3, 1, 0), cex = 0.6)

barplot(TopPropertyDamage$PropertyDamage, las = 3, names.arg = TopPropertyDamage$EVTYPE, main = "Top 10 Highest Property Damage", ylab = "Damage (USD)", col = "blue")

barplot(TopCropDamage$CropDamage, las = 3, names.arg = TopCropDamage$EVTYPE, main = "Top 10 Highest Crop Damage", ylab = "Damage (USD)", col = "blue")

barplot(TopTotalDamage$TotalDamage, las = 3, names.arg = TopTotalDamage$EVTYPE, main = "Top 10 Highest Total Damage", ylab = "Damage (USD)", col = "grey")

Summary

From this plot, we can know that overall, flood cause most economic damage. The biggest contributor to crop damage is drought.