Synopsis

In this study we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States (between 1950 and 2011), including when and where they occur, as well as estimates of any fatalities, injuries, property and crop damage.

We will try to figure out which types of events, across the United States, are most harmful : 1/ With respect to population health (question 1) 2/ With respect to economic consequences (question 2)

Data Processing

We got our data from : https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 Time of download : 30th of october 2016, at 16h GMT

We set our session in the same folder as our data in Rstudio Now let’s load and have a look at our data

data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", header=TRUE)
## Let's have a look at our data
head(data)## This shows the first values, just to understand how data is structured
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
dim(data)
## [1] 902297     37

We need to process our data a little bit, because it is not easy to read “economic damage” information. Indeed we have 2 columns of values (PROPDMG for property and CROPDMG for crop damage) and 2 columns of magnitude (PROPDMGEXP and CROPDMGEXP with possible values : K or M or B for resp. Thousands, Millions, Billions) We need to create a new variable “TOTAL_cost” that summerizes the total economic cost.

## We need to loop and calculate each agregated value
loop_size <- dim(data)[1]
TOTAL_cost <- NULL
for (i in 1:loop_size) {
  cost1 <- data$PROPDMG[i]
  Mag1 <- data$PROPDMGEXP[i]
  if (Mag1=="K") {cost1 <- cost1*1000} ## We want all values to be in $
  if (Mag1=="M") {cost1 <- cost1*1000000} ## We want all values to be in $
  if (Mag1=="B") {cost1 <- cost1*1000000000} ## We want all values to be in $
  
  cost2 <- data$CROPDMG[i]
  Mag2 <- data$CROPDMGEXP[i]
  if (Mag2=="K") {cost2 <- cost2*1000} ## We want all values to be in $
  if (Mag2=="M") {cost2 <- cost2*1000000} ## We want all values to be in $
  if (Mag2=="B") {cost2 <- cost2*1000000000} ## We want all values to be in $
  
  data$TOTAL_cost[i] <- (cost1 + cost2)/1000000000 ## We want the total value to be in Billion $

  ## The following 4 lines of code can be activated to print % progress because this calculation is time consuming !
  ##if(i==1){print("staring for loop!")} ## to see when the loop starts
  ##ref <- 1:100
  ##ref <- ref * 9000
  ##if(i %in% ref){print(i/9000)} ## This is a percentage progress

  ## End of for loop
}

Question 1 : Across the United States, which types of events are most harmful with respect to population health?

We will answer in 2 parts : in terms of fatalities and in terms of injuries

1/ Fatalities

Nb_Fatalities_per_EVTYPE <- with(data, tapply(FATALITIES,EVTYPE,sum,na.rm=TRUE))
## We have 985 values ! 
dim(Nb_Fatalities_per_EVTYPE)
## [1] 985
## The question is about "the most harmful events", so let's find the TOP 10 (ie 1%)
quantile(Nb_Fatalities_per_EVTYPE, .99)
##    99% 
## 208.88
## Let's check that our 1% top values are TOP 10
sum(Nb_Fatalities_per_EVTYPE>208) ## we validate we get 10 values
## [1] 10
## Let's extract our TOP 10
TOP_10 <- Nb_Fatalities_per_EVTYPE[Nb_Fatalities_per_EVTYPE>208]
## Let's demonstrate that our TOP 10 has major impact => it represents 80% of the total fatalities!(pareto rule)
sum(TOP_10)/sum(data$FATALITIES) ## ~1% top values <=> 80% of total nb of fatalities
## [1] 0.797689
## Let's sort and answer the question :
TOP_10_values1.1 <- TOP_10[order(TOP_10,decreasing=TRUE)]
TOP_10_values1.1
##        TORNADO EXCESSIVE HEAT    FLASH FLOOD           HEAT      LIGHTNING 
##           5633           1903            978            937            816 
##      TSTM WIND          FLOOD    RIP CURRENT      HIGH WIND      AVALANCHE 
##            504            470            368            248            224

2/ Injuries

Nb_Injuries_per_EVTYPE <- with(data, tapply(INJURIES,EVTYPE,sum,na.rm=TRUE))
## We have 985 values ! 
dim(Nb_Injuries_per_EVTYPE)
## [1] 985
## The question is about "the most harmful events", so let's find the TOP 10 (ie 1%)
quantile(Nb_Injuries_per_EVTYPE, .99)
##    99% 
## 1327.4
## Let's check that our 1% top values are TOP 10
sum(Nb_Injuries_per_EVTYPE>1327) ## we validate we get 10 values
## [1] 10
## Let's extract our TOP 10
TOP_10 <- Nb_Injuries_per_EVTYPE[Nb_Injuries_per_EVTYPE>1327]
## Let's demonstrate that our TOP 10 has major impact => it represents 90% of the total injuries!(pareto rule, even more than fatalities)
sum(TOP_10)/sum(data$INJURIES) ## ~1% top values <=> 90% of total nb of injuries
## [1] 0.893402
## Let's sort and answer the question :
TOP_10_values1.2 <- TOP_10[order(TOP_10,decreasing=TRUE)]
TOP_10_values1.2
##           TORNADO         TSTM WIND             FLOOD    EXCESSIVE HEAT 
##             91346              6957              6789              6525 
##         LIGHTNING              HEAT         ICE STORM       FLASH FLOOD 
##              5230              2100              1975              1777 
## THUNDERSTORM WIND              HAIL 
##              1488              1361

Question 2 : Across the United States, which types of events have the greatest economic consequences?

Let’s model economic consequences as a sum of 2 amounts : property damage (in $) + crop damage (in $) Hopefully, we have already created in “processing data” part a new variable “TOTAL_cost” that summerizes economic cost

## We can proceed as in question 1, with 1 variable = TOTAL_cost
Amount_per_EVTYPE <- with(data, tapply(TOTAL_cost,EVTYPE,sum,na.rm=TRUE))
## We have 985 values ! 
dim(Amount_per_EVTYPE)
## [1] 985
## The question is about "the most harmful events", so let's find the TOP 10 (ie 1%)
quantile(Amount_per_EVTYPE, .99)
##      99% 
## 8.475805
## Let's check that our 1% top values are TOP 10
sum(Amount_per_EVTYPE>8.4) ## we validate we get 10 values
## [1] 10
## Let's extract our TOP 10
TOP_10 <- Amount_per_EVTYPE[Amount_per_EVTYPE>8.4]
## Let's demonstrate that our TOP 10 has major impact => it represents 85% of the total cost!(pareto rule)
sum(TOP_10)/sum(data$TOTAL_cost) ## ~1% top values <=> 85% of total cost
## [1] 0.8563804
## Let's sort and answer the question :
TOP_10_values2 <- TOP_10[order(TOP_10,decreasing=TRUE)]
TOP_10_values2
##             FLOOD HURRICANE/TYPHOON           TORNADO       STORM SURGE 
##        150.319678         71.913713         57.340614         43.323541 
##              HAIL       FLASH FLOOD           DROUGHT         HURRICANE 
##         18.752905         17.562129         15.018672         14.610229 
##       RIVER FLOOD         ICE STORM 
##         10.148404          8.967041

Results

Summary answer to question 1: Be it in terms of fatalities or in terms of injuries, we have a TOP 10 events explaining more than 80% of damages (pareto rule). But 1 event appears to be more harmful than all the others (especially in terms of injuries) : TORNADO

## Let's plot the TOP 10 most harmful events in terms of fatalities
barplot(TOP_10_values1.1,xlab="", ylab="Nb of fatalities",cex.names=1, las=2, main="Nb of fatalities for the TOP 10 most harmful events (1950-2011)")

##  This TOP 10 is significative because as we said, it represents 80% of the total fatalities
## Let's plot the TOP 10 most harmful events in terms of injuries
barplot(TOP_10_values1.2,xlab="", ylab="Nb of injuries",cex.names=1, las=2, main="Nb of injuries for the TOP 10 most harmful events (1950-2011)")

##  This TOP 10 is significative because as we said, it represents 90% of the total injuries

Summary answer to question 2: Again, the TOP 10 explains more than 80% of the total damages, this time in terms of economic cost The result is very interesting because different from question 1 : FLOOD related events appear this time, in terms of economic damages, to be far more harmful than TORNADO

## Let's plot the TOP 10 most harmful events in terms of economic damage (property and crop)
barplot(TOP_10_values2,xlab="", ylab="Total cost of damages (in Billion $)",cex.names=1, las=2, main="TOTAL damage cost of the TOP 10 most harmful events (1950-2011)")

##  This TOP 10 is significative because as we said, it represents 85% of the total cost