This document will show that RIP CURRENT and EXCEESIVE HEAT are the events that most people killed in average by each time they happen. Also, the events that have killed more people between 2002 and November 2011 are TORNADO and again EXCEESIVE HEAT.
On the other hand, the event that more economic damage produces each time it occurs is HURRICANE/TYPHOON with $817 million of damage, followed by STORM SURGE and HURRICANE. However, summarizing the data from 2002 up to November 2011, the event FLOOD leads the damage with a total economic damage of $13.7 billions.
The data for this analysis was downloaded from the NOAA web site (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). Then it was extracted and preprocessed using R Studio. In the next section we will see how these tasks are performed in detail:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(xtable)
## Warning: package 'xtable' was built under R version 3.1.3
library(reshape)
## Warning: package 'reshape' was built under R version 3.1.3
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:dplyr':
##
## rename
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
Sys.setlocale(category = "LC_ALL", locale = "C")
## [1] "C"
options(scipen = 20)
Here we have downloaded the file from NOAA and copied to the project folder. Then we load the file using R:
# download.file("https://d396qusza40orc.cloudfront.net/repdata-data-StormData.csv.bz2", "repdata-data-StormData.csv.bz2")
data= read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors=FALSE)
Let’s look at the data loaded:
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In the following section, we will convert the date column to date format. Also we will filter our data set to include only the records from 2002. In this way, we will make sure that we work with relatively recent data as well as avoid some normalization of old data. Also, we will work only with events that have 30 or more observations in order to have a better statistical relevance of our analysis.
#converting dates:
data$BGN_DATE=as.Date(data$BGN_DATE, "%m/%d/%Y")
#Using only roughly 10 years for the analysis
data=dplyr::filter(data, BGN_DATE>='2002-01-01')
#also we will remove the events where we have less than 30 observations:
enoughtDataEvents=dplyr::group_by(data, EVTYPE) %>% dplyr::summarise(n=n()) %>% filter(n>=30)
data=dplyr::inner_join(data, enoughtDataEvents, by="EVTYPE")
Let’s see now how is the data related to economic damage:
table(data$PROPDMGEXP)
##
## 0 B K M
## 141242 1 27 307171 4911
summary(data$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 11.42 1.00 5000.00
table(data$CROPDMGEXP)
##
## B K M
## 185500 4 266829 1019
summary(data$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.689 0.000 990.000
Looking at the economic damage columns (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), we see that we can intuitively understand the values (K=thousands, M=millions, B=billions). However, we can see an odd value of “0” in the PROPDMGEXP column, for 1 observation. We will remove this observation. Also, we normalize the event “RIP CURRENT”, which is also written as “RIP CURRENTS” in the database.
data=dplyr::filter(data, PROPDMGEXP!="0")
data$EVTYPE=ifelse(grepl("RIP CURRENT", data$EVTYPE), "RIP CURRENT", data$EVTYPE)
We can assume that the total economic damage can be expressed by the sum of the different types of damages: PROPDMG and CROPDMG. For this analysis we will transform those values according to the columns PROPDMGEXP and CROPDMGEXP. The values in those columns (“K”, “M”, and “B”) indicate that the damage is in thousands, millions or billions of dollars, respectively. Also, we will scale the final values to a magnitude of thousands of dollars.
data$PROPDMG2=ifelse(data$PROPDMGEXP=="K", data$PROPDMG,
ifelse(data$PROPDMGEXP=="M", data$PROPDMG*1000,
ifelse(data$PROPDMGEXP=="B", data$PROPDMG*1000000,
data$PROPDMG/1000 #normalize to thousands of dollars
)))
data$CROPDMG2=ifelse(data$CROPDMGEXP=="K", data$CROPDMG,
ifelse(data$CROPDMGEXP=="M", data$CROPDMG*1000,
ifelse(data$CROPDMGEXP=="B", data$CROPDMG*1000000,
data$CROPDMG/1000 #normalize to thousands of dollars
)))
Lets look at the events with most fatalities per each occurrence:
tmp=dplyr::group_by(data, EVTYPE) %>%
dplyr::summarize(FATALITIES=sum(FATALITIES, na.rm=T),
INJURIES=sum(INJURIES, na.rm=T),
Events=n()) %>%
dplyr::mutate(FatalitiesPerEvent=(FATALITIES/Events))
tmp=as.data.frame(tmp)
tmp=dplyr::arrange(tmp, -FatalitiesPerEvent)
print(xtable(head(tmp,10)),type = "html", comment=F)
| EVTYPE | FATALITIES | INJURIES | Events | FatalitiesPerEvent | |
|---|---|---|---|---|---|
| 1 | RIP CURRENT | 384.00 | 290.00 | 492 | 0.78 |
| 2 | EXCESSIVE HEAT | 691.00 | 2797.00 | 901 | 0.77 |
| 3 | HURRICANE/TYPHOON | 64.00 | 1275.00 | 88 | 0.73 |
| 4 | AVALANCHE | 145.00 | 103.00 | 274 | 0.53 |
| 5 | HEAT | 229.00 | 1222.00 | 709 | 0.32 |
| 6 | MARINE STRONG WIND | 14.00 | 22.00 | 48 | 0.29 |
| 7 | FOG | 22.00 | 146.00 | 84 | 0.26 |
| 8 | EXTREME COLD | 13.00 | 0.00 | 69 | 0.19 |
| 9 | HEAVY SURF/HIGH SURF | 42.00 | 48.00 | 228 | 0.18 |
| 10 | COLD/WIND CHILL | 95.00 | 12.00 | 539 | 0.18 |
Now, lets see the events that contribute more to fatalities, regardless the number of events:
tmp=dplyr::arrange(tmp, -FATALITIES)
print(xtable(head(tmp,10)),type = "html", comment=F)
| EVTYPE | FATALITIES | INJURIES | Events | FatalitiesPerEvent | |
|---|---|---|---|---|---|
| 1 | TORNADO | 1112.00 | 13588.00 | 15167 | 0.07 |
| 2 | EXCESSIVE HEAT | 691.00 | 2797.00 | 901 | 0.77 |
| 3 | FLASH FLOOD | 539.00 | 517.00 | 36031 | 0.01 |
| 4 | RIP CURRENT | 384.00 | 290.00 | 492 | 0.78 |
| 5 | LIGHTNING | 370.00 | 2250.00 | 7899 | 0.05 |
| 6 | FLOOD | 247.00 | 301.00 | 18071 | 0.01 |
| 7 | HEAT | 229.00 | 1222.00 | 709 | 0.32 |
| 8 | AVALANCHE | 145.00 | 103.00 | 274 | 0.53 |
| 9 | THUNDERSTORM WIND | 130.00 | 1400.00 | 81402 | 0.00 |
| 10 | EXTREME COLD/WIND CHILL | 125.00 | 24.00 | 1002 | 0.12 |
We can see that tornado events have taken the biggest number of lives between 2002 and the end of 2011. However, if we look at the deaths per each occurrence of the events, the more dangerous in terms of fatalities are rip currents.
Now, lets see the events with most injuries per each occurrence:
tmp=dplyr::mutate(tmp, InjuriesPerEvent=(INJURIES/Events)) %>%
dplyr::arrange(-InjuriesPerEvent) %>%
dplyr::select(-FatalitiesPerEvent)
print(xtable(head(tmp,10)),type = "html", comment=F)
| EVTYPE | FATALITIES | INJURIES | Events | InjuriesPerEvent | |
|---|---|---|---|---|---|
| 1 | HURRICANE/TYPHOON | 64.00 | 1275.00 | 88 | 14.49 |
| 2 | EXCESSIVE HEAT | 691.00 | 2797.00 | 901 | 3.10 |
| 3 | FOG | 22.00 | 146.00 | 84 | 1.74 |
| 4 | HEAT | 229.00 | 1222.00 | 709 | 1.72 |
| 5 | TORNADO | 1112.00 | 13588.00 | 15167 | 0.90 |
| 6 | DUST STORM | 9.00 | 209.00 | 342 | 0.61 |
| 7 | RIP CURRENT | 384.00 | 290.00 | 492 | 0.59 |
| 8 | MARINE STRONG WIND | 14.00 | 22.00 | 48 | 0.46 |
| 9 | TROPICAL STORM | 27.00 | 261.00 | 584 | 0.45 |
| 10 | WILD/FOREST FIRE | 1.00 | 140.00 | 345 | 0.41 |
And also, the injuries regardless the number of occurrence of the events
tmp=dplyr::arrange(tmp, -INJURIES)
print(xtable(head(tmp,10)),type = "html", comment=F)
| EVTYPE | FATALITIES | INJURIES | Events | InjuriesPerEvent | |
|---|---|---|---|---|---|
| 1 | TORNADO | 1112.00 | 13588.00 | 15167 | 0.90 |
| 2 | EXCESSIVE HEAT | 691.00 | 2797.00 | 901 | 3.10 |
| 3 | LIGHTNING | 370.00 | 2250.00 | 7899 | 0.28 |
| 4 | THUNDERSTORM WIND | 130.00 | 1400.00 | 81402 | 0.02 |
| 5 | HURRICANE/TYPHOON | 64.00 | 1275.00 | 88 | 14.49 |
| 6 | HEAT | 229.00 | 1222.00 | 709 | 1.72 |
| 7 | TSTM WIND | 77.00 | 1146.00 | 61122 | 0.02 |
| 8 | WILDFIRE | 75.00 | 911.00 | 2732 | 0.33 |
| 9 | FLASH FLOOD | 539.00 | 517.00 | 36031 | 0.01 |
| 10 | HIGH WIND | 97.00 | 486.00 | 14749 | 0.03 |
Now we will see which type of events have the greatest damage in terms of property and corp damage.
Lets look at the events with the biggest total damage per each occurrence of the event. The damage is expressed in thousands of dollars :
tmp=dplyr::mutate(data, TotalDamage=PROPDMG2+CROPDMG2) %>%
dplyr::group_by(EVTYPE) %>%
dplyr::summarize(PropertyDamage=sum(PROPDMG2, na.rm=T),
CropDamage=sum(CROPDMG2, na.rm=T),
TotalDamage=sum(TotalDamage, na.rm=T),
Events=n()) %>%
dplyr::mutate(TotalDamagePerEvent=(TotalDamage/Events))
tmp=as.data.frame(tmp)
tmp=dplyr::arrange(tmp, -TotalDamagePerEvent)
print(xtable(head(tmp,10)),type = "html", comment=F)
| EVTYPE | PropertyDamage | CropDamage | TotalDamage | Events | TotalDamagePerEvent | |
|---|---|---|---|---|---|---|
| 1 | HURRICANE/TYPHOON | 69305840.00 | 2607872.80 | 71913712.80 | 88 | 817201.28 |
| 2 | STORM SURGE | 43168315.00 | 0.00 | 43168315.00 | 103 | 419109.85 |
| 3 | HURRICANE | 3036855.01 | 448510.00 | 3485365.01 | 40 | 87134.13 |
| 4 | STORM SURGE/TIDE | 4641188.00 | 850.00 | 4642038.00 | 148 | 31365.12 |
| 5 | FLOOD | 133387648.53 | 3591907.40 | 136979555.93 | 18071 | 7580.08 |
| 6 | TROPICAL STORM | 2008360.55 | 410061.00 | 2418421.55 | 584 | 4141.13 |
| 7 | DROUGHT | 845958.00 | 5423625.00 | 6269583.00 | 1845 | 3398.15 |
| 8 | WILDFIRE | 4758667.00 | 295472.80 | 5054139.80 | 2732 | 1849.98 |
| 9 | ICE STORM | 1964717.80 | 8665.00 | 1973382.80 | 1267 | 1557.52 |
| 10 | TORNADO | 18406922.66 | 220589.91 | 18627512.57 | 15167 | 1228.16 |
Now, lets see the events that contribute more to the total damage, regardless the number of events. The damage is expressed in thousands of dollars:
tmp=dplyr::arrange(tmp, -TotalDamage)
print(xtable(head(tmp,10)),type = "html", comment=F)
| EVTYPE | PropertyDamage | CropDamage | TotalDamage | Events | TotalDamagePerEvent | |
|---|---|---|---|---|---|---|
| 1 | FLOOD | 133387648.53 | 3591907.40 | 136979555.93 | 18071 | 7580.08 |
| 2 | HURRICANE/TYPHOON | 69305840.00 | 2607872.80 | 71913712.80 | 88 | 817201.28 |
| 3 | STORM SURGE | 43168315.00 | 0.00 | 43168315.00 | 103 | 419109.85 |
| 4 | TORNADO | 18406922.66 | 220589.91 | 18627512.57 | 15167 | 1228.16 |
| 5 | FLASH FLOOD | 10709402.71 | 812514.00 | 11521916.71 | 36031 | 319.78 |
| 6 | HAIL | 9174277.52 | 1393287.90 | 10567565.42 | 142257 | 74.29 |
| 7 | DROUGHT | 845958.00 | 5423625.00 | 6269583.00 | 1845 | 3398.15 |
| 8 | HIGH WIND | 4833335.91 | 494272.00 | 5327607.91 | 14749 | 361.22 |
| 9 | WILDFIRE | 4758667.00 | 295472.80 | 5054139.80 | 2732 | 1849.98 |
| 10 | STORM SURGE/TIDE | 4641188.00 | 850.00 | 4642038.00 | 148 | 31365.12 |
Here we can see that in total, the type of event that produced more damage from 2002 and the end of 2011 were floods. They damage almost twice than hurricane/typhons. However, the last ones produced a lot more damage if we look at them as the average damage for each occuerrence of the event.
Finally, we will compare the economical damage in thousands of dollars, and the fatalities for a selected group of events:
mainEvents=c("AVALANCHE","DROUGHT","DUST STORM","EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FOG",
"HAIL","HEAT","HIGH WIND","HURRICANE/TYPHOON","LIGHTNING",
"MARINE STRONG WIND","RIP CURRENT","RIP CURRENTS","STORM SURGE",
"STORM SURGE/TIDE","THUNDERSTORM WIND","TORNADO","TROPICAL STORM","WILDFIRE")
tmp=data[,]
tmp <- melt(tmp[,c("EVTYPE", "PROPDMG2", "CROPDMG2")], id.vars=c("EVTYPE"))
colnames(tmp)[2:3]=c("Type", "T.Dollars")
tmp=dplyr::group_by(tmp, EVTYPE, Type) %>%
dplyr::summarize(T.Dollars=sum(T.Dollars, na.rm=T))
tmp=tmp[tmp$EVTYPE %in% mainEvents,]
g1 = ggplot(tmp, aes(x = EVTYPE, y=T.Dollars, fill=Type)) +
geom_bar(alpha=0.7, stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
ggtitle("Economic cost per type of event (Thousands U$)")
#Now the human cost:
tmp=data[,]
tmp <- melt(tmp[,c("EVTYPE", "FATALITIES")], id.vars=c("EVTYPE"))
colnames(tmp)[2:3]=c("Type", "T.Dollars")
tmp=dplyr::group_by(data, EVTYPE) %>%
dplyr::summarize(FATALITIES=sum(FATALITIES, na.rm=T),
INJURIES=sum(INJURIES, na.rm=T))
tmp=tmp[tmp$EVTYPE %in% mainEvents,]
g2 = ggplot(tmp, aes(x = EVTYPE, y=FATALITIES)) +
geom_bar(alpha=0.7, stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
ggtitle("Fatalities per type of event")
g3 = ggplot(tmp, aes(x = EVTYPE, y=INJURIES)) +
geom_bar(alpha=0.7, stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
ggtitle("Injuries per type of event")
g1 #Economic Cost
g2 #Fatalities
g3 #Injuries