Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

In order to execute this project, some of the library required.

library(R.utils)
library(dplyr)
library(ggplot2)
library(gridExtra)

Data Processing

In order to answer this question, there are several steps need to be followed.

  1. Download the dataset which can be found directly from the URL given; or simply use the following command using R.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "repdata_data_StormData.csv.bz2")
  1. Once downloaded, extract the file manually in the working directory; or simply use the following command using R. Check there is a file named repdata_data_StormData.csv extracted from the bunzip2 command.
bunzip2("repdata_data_StormData.csv.bz2")
  1. Read the data from the csv file and run head andnames for the overview of the data.
storm<-read.csv("repdata_data_StormData.csv")
head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
names(storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
  1. Subset the data by selecting interesting variables.
stormSelect <- subset(storm,select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
head(stormSelect)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

1st Question (Results)

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  1. Sum the Fatalities and Injuries based on the Evtype and show the top 10 of the most harmfull events.
sortFatalities <- stormSelect %>% 
  group_by(EVTYPE) %>% 
  summarize(FATALITIES = sum(FATALITIES)) %>% 
  arrange(desc(FATALITIES))
sortFatalities[1:10,]
## Source: local data frame [10 x 2]
## 
##            EVTYPE FATALITIES
##            (fctr)      (dbl)
## 1         TORNADO       5633
## 2  EXCESSIVE HEAT       1903
## 3     FLASH FLOOD        978
## 4            HEAT        937
## 5       LIGHTNING        816
## 6       TSTM WIND        504
## 7           FLOOD        470
## 8     RIP CURRENT        368
## 9       HIGH WIND        248
## 10      AVALANCHE        224
sortInjuries <- stormSelect %>% 
  group_by(EVTYPE) %>% 
  summarize(INJURIES =sum(INJURIES)) %>% 
  arrange(desc(INJURIES))

sortInjuries[1:10,]
## Source: local data frame [10 x 2]
## 
##               EVTYPE INJURIES
##               (fctr)    (dbl)
## 1            TORNADO    91346
## 2          TSTM WIND     6957
## 3              FLOOD     6789
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1777
## 9  THUNDERSTORM WIND     1488
## 10              HAIL     1361
  1. Design and draw the graph of the most harmfull event in the US
sortFatalities$EVTYPE <- factor(sortFatalities$EVTYPE, levels = sortFatalities$EVTYPE[order(sortFatalities$FATALITIES)])


ggplot(data=sortFatalities[1:10,], aes(x=EVTYPE,y=FATALITIES)) +
geom_bar(stat="identity", fill="red") +  

ylab("Fatalities") + 
xlab("Event Type") + 
coord_flip() +
ggtitle("Fatalities vs. Event Type across the U.S (Top Ten)")

sortInjuries$EVTYPE <- factor(sortInjuries$EVTYPE, levels = sortInjuries$EVTYPE[order(sortInjuries$INJURIES)])

ggplot(data=sortInjuries[1:10,], aes(x=EVTYPE,y=INJURIES)) + 
geom_bar(stat="identity", fill="red") +  
ylab("Injuries") + 
xlab("Event Type") +  
coord_flip() +
ggtitle("Injuries vs. Event Type across the U.S (Top Ten)")

As we can see in the graph, Tonado is the most harmful event.

2nd Question (Results)

Across the United States, which types of events have the greatest economic consequences?

  1. Convert the Prop and Crop exponential to real numbers, and add these values as TotalFinal.
stormSelect$pexp=0
stormSelect$pexp[stormSelect$PROPDMGEXP=='K']<-1000
stormSelect$pexp[stormSelect$PROPDMGEXP=='M']<-1000000
stormSelect$pexp[stormSelect$PROPDMGEXP=='B']<-1000000000
stormSelect$cexp=0
stormSelect$cexp[stormSelect$CROPDMGEXP=='K']<-1000
stormSelect$cexp[stormSelect$CROPDMGEXP=='M']<-1000000
stormSelect$cexp[stormSelect$CROPDMGEXP=='B']<-1000000000

stormSelect$propFinal <- stormSelect$PROPDMG*stormSelect$pexp
stormSelect$cropFinal <- stormSelect$CROPDMG*stormSelect$cexp
stormSelect$TotalFinal <- stormSelect$propFinal+stormSelect$cropFinal
  1. Sum the TotalFinal based on the Evtype and show the top 10 of the greatest economic impact (in $million).
sortTotalFinal <- stormSelect %>% 
  group_by(EVTYPE) %>% 
  summarize(TotalFinal = sum(TotalFinal)/1e6) %>% 
  arrange(desc(TotalFinal))
sortTotalFinal[1:10,]
## Source: local data frame [10 x 2]
## 
##               EVTYPE TotalFinal
##               (fctr)      (dbl)
## 1              FLOOD 150319.678
## 2  HURRICANE/TYPHOON  71913.713
## 3            TORNADO  57340.614
## 4        STORM SURGE  43323.541
## 5               HAIL  18752.904
## 6        FLASH FLOOD  17562.129
## 7            DROUGHT  15018.672
## 8          HURRICANE  14610.229
## 9        RIVER FLOOD  10148.405
## 10         ICE STORM   8967.041
  1. Design and draw the graph of the greatest economic impact
sortTotalFinal$EVTYPE <- factor(sortTotalFinal$EVTYPE, levels = sortTotalFinal$EVTYPE[order(sortTotalFinal$TotalFinal)])


ggplot(data=sortTotalFinal[1:10,], aes(x=EVTYPE,y=TotalFinal)) +
geom_bar(stat="identity", fill="red") +  
ylab("Total Damage ($ in Million)") + 
xlab("Event Type") + 
coord_flip() +
ggtitle("Economic Impact (Total Damage) across the U.S (Top Ten)")

In the conclusion, flood is the event that has the greatest impact on economic impact in the US