In this study we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States (between 1950 and 2011), including when and where they occur, as well as estimates of any fatalities, injuries, property and crop damage.
We will try to figure out which types of events, across the United States, are most harmful : 1/ With respect to population health (question 1) 2/ With respect to economic consequences (question 2)
We got our data from : https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 Time of download : 30th of october 2016, at 16h GMT
We set our session in the same folder as our data in Rstudio Now let’s load and have a look at our data
data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", header=TRUE)
## Let's have a look at our data
head(data)## This shows the first values, just to understand how data is structured
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
dim(data)
## [1] 902297 37
We need to process our data a little bit, because it is not easy to read “economic damage” information. Indeed we have 2 columns of values (PROPDMG for property and CROPDMG for crop damage) and 2 columns of magnitude (PROPDMGEXP and CROPDMGEXP with possible values : K or M or B for resp. Thousands, Millions, Billions) We need to create a new variable “TOTAL_cost” that summerizes the total economic cost.
## We need to loop and calculate each agregated value
loop_size <- dim(data)[1]
TOTAL_cost <- NULL
for (i in 1:loop_size) {
cost1 <- data$PROPDMG[i]
Mag1 <- data$PROPDMGEXP[i]
if (Mag1=="K") {cost1 <- cost1*1000} ## We want all values to be in $
if (Mag1=="M") {cost1 <- cost1*1000000} ## We want all values to be in $
if (Mag1=="B") {cost1 <- cost1*1000000000} ## We want all values to be in $
cost2 <- data$CROPDMG[i]
Mag2 <- data$CROPDMGEXP[i]
if (Mag2=="K") {cost2 <- cost2*1000} ## We want all values to be in $
if (Mag2=="M") {cost2 <- cost2*1000000} ## We want all values to be in $
if (Mag2=="B") {cost2 <- cost2*1000000000} ## We want all values to be in $
data$TOTAL_cost[i] <- (cost1 + cost2)/1000000000 ## We want the total value to be in Billion $
## The following 4 lines of code can be activated to print % progress because this calculation is time consuming !
##if(i==1){print("staring for loop!")} ## to see when the loop starts
##ref <- 1:100
##ref <- ref * 9000
##if(i %in% ref){print(i/9000)} ## This is a percentage progress
## End of for loop
}
We will answer in 2 parts : in terms of fatalities and in terms of injuries
1/ Fatalities
Nb_Fatalities_per_EVTYPE <- with(data, tapply(FATALITIES,EVTYPE,sum,na.rm=TRUE))
## We have 985 values !
dim(Nb_Fatalities_per_EVTYPE)
## [1] 985
## The question is about "the most harmful events", so let's find the TOP 10 (ie 1%)
quantile(Nb_Fatalities_per_EVTYPE, .99)
## 99%
## 208.88
## Let's check that our 1% top values are TOP 10
sum(Nb_Fatalities_per_EVTYPE>208) ## we validate we get 10 values
## [1] 10
## Let's extract our TOP 10
TOP_10 <- Nb_Fatalities_per_EVTYPE[Nb_Fatalities_per_EVTYPE>208]
## Let's demonstrate that our TOP 10 has major impact => it represents 80% of the total fatalities!(pareto rule)
sum(TOP_10)/sum(data$FATALITIES) ## ~1% top values <=> 80% of total nb of fatalities
## [1] 0.797689
## Let's sort and answer the question :
TOP_10_values1.1 <- TOP_10[order(TOP_10,decreasing=TRUE)]
TOP_10_values1.1
## TORNADO EXCESSIVE HEAT FLASH FLOOD HEAT LIGHTNING
## 5633 1903 978 937 816
## TSTM WIND FLOOD RIP CURRENT HIGH WIND AVALANCHE
## 504 470 368 248 224
2/ Injuries
Nb_Injuries_per_EVTYPE <- with(data, tapply(INJURIES,EVTYPE,sum,na.rm=TRUE))
## We have 985 values !
dim(Nb_Injuries_per_EVTYPE)
## [1] 985
## The question is about "the most harmful events", so let's find the TOP 10 (ie 1%)
quantile(Nb_Injuries_per_EVTYPE, .99)
## 99%
## 1327.4
## Let's check that our 1% top values are TOP 10
sum(Nb_Injuries_per_EVTYPE>1327) ## we validate we get 10 values
## [1] 10
## Let's extract our TOP 10
TOP_10 <- Nb_Injuries_per_EVTYPE[Nb_Injuries_per_EVTYPE>1327]
## Let's demonstrate that our TOP 10 has major impact => it represents 90% of the total injuries!(pareto rule, even more than fatalities)
sum(TOP_10)/sum(data$INJURIES) ## ~1% top values <=> 90% of total nb of injuries
## [1] 0.893402
## Let's sort and answer the question :
TOP_10_values1.2 <- TOP_10[order(TOP_10,decreasing=TRUE)]
TOP_10_values1.2
## TORNADO TSTM WIND FLOOD EXCESSIVE HEAT
## 91346 6957 6789 6525
## LIGHTNING HEAT ICE STORM FLASH FLOOD
## 5230 2100 1975 1777
## THUNDERSTORM WIND HAIL
## 1488 1361
Let’s model economic consequences as a sum of 2 amounts : property damage (in $) + crop damage (in $) Hopefully, we have already created in “processing data” part a new variable “TOTAL_cost” that summerizes economic cost
## We can proceed as in question 1, with 1 variable = TOTAL_cost
Amount_per_EVTYPE <- with(data, tapply(TOTAL_cost,EVTYPE,sum,na.rm=TRUE))
## We have 985 values !
dim(Amount_per_EVTYPE)
## [1] 985
## The question is about "the most harmful events", so let's find the TOP 10 (ie 1%)
quantile(Amount_per_EVTYPE, .99)
## 99%
## 8.475805
## Let's check that our 1% top values are TOP 10
sum(Amount_per_EVTYPE>8.4) ## we validate we get 10 values
## [1] 10
## Let's extract our TOP 10
TOP_10 <- Amount_per_EVTYPE[Amount_per_EVTYPE>8.4]
## Let's demonstrate that our TOP 10 has major impact => it represents 85% of the total cost!(pareto rule)
sum(TOP_10)/sum(data$TOTAL_cost) ## ~1% top values <=> 85% of total cost
## [1] 0.8563804
## Let's sort and answer the question :
TOP_10_values2 <- TOP_10[order(TOP_10,decreasing=TRUE)]
TOP_10_values2
## FLOOD HURRICANE/TYPHOON TORNADO STORM SURGE
## 150.319678 71.913713 57.340614 43.323541
## HAIL FLASH FLOOD DROUGHT HURRICANE
## 18.752905 17.562129 15.018672 14.610229
## RIVER FLOOD ICE STORM
## 10.148404 8.967041
Summary answer to question 1: Be it in terms of fatalities or in terms of injuries, we have a TOP 10 events explaining more than 80% of damages (pareto rule). But 1 event appears to be more harmful than all the others (especially in terms of injuries) : TORNADO
## Let's plot the TOP 10 most harmful events in terms of fatalities
barplot(TOP_10_values1.1,xlab="", ylab="Nb of fatalities",cex.names=1, las=2, main="Nb of fatalities for the TOP 10 most harmful events (1950-2011)")
## This TOP 10 is significative because as we said, it represents 80% of the total fatalities
## Let's plot the TOP 10 most harmful events in terms of injuries
barplot(TOP_10_values1.2,xlab="", ylab="Nb of injuries",cex.names=1, las=2, main="Nb of injuries for the TOP 10 most harmful events (1950-2011)")
## This TOP 10 is significative because as we said, it represents 90% of the total injuries
Summary answer to question 2: Again, the TOP 10 explains more than 80% of the total damages, this time in terms of economic cost The result is very interesting because different from question 1 : FLOOD related events appear this time, in terms of economic damages, to be far more harmful than TORNADO
## Let's plot the TOP 10 most harmful events in terms of economic damage (property and crop)
barplot(TOP_10_values2,xlab="", ylab="Total cost of damages (in Billion $)",cex.names=1, las=2, main="TOTAL damage cost of the TOP 10 most harmful events (1950-2011)")
## This TOP 10 is significative because as we said, it represents 85% of the total cost